VIET NAM NATIONAL UNIVERSITY, HANOI 
COLLEGE OF TECHNOLOGY 
 
 
 
 
 
 
 
NGUYEN CAM TU 
 
 
 
 
 
 
HIDDEN TOPIC DISCOVERY TOWARD 
CLASSIFICATION AND CLUSTERING IN 
VIETNAMESE WEB DOCUMENTS 
 
 
 
 
 
 
 
 
 
MASTER THESIS 
 
          HANOI - 2008  
VIET NAM NATIONAL UNIVERSITY, HANOI 
COLLEGE OF TECHNOLOGY       
NGUYEN CAM TU      
HIDDEN TOPIC DISCOVERY TOWARD 
CLASSIFICATION AND CLUSTERING IN 
VIETNAMESE WEB DOCUMENTS  
   Major: Information Technology 
Specificity: Information Systems 
Code: 60 48 05    
MASTER THESIS     
SUPERVISOR: Prof. Dr. Ha Quang Thuy       
HANOI - 2008 
i 
Acknowledgements 
My deepest thank must first go to my research advisor, Prof. Dr. Ha Quang Thuy, who 
offers me an endless inspiration in scientific research, leading me to this research area. I 
particularly appreciate his unconditional support and advice in both academic 
environment and daily life during the last four years. 
Many thanks go to Dr. Phan Xuan Hieu who has given me many advices and comments. 
This work can not be possible without his support. Also, I would like to thank him for 
being my friend, my older brother who has brought me a lot of lessons in both scientific 
research and daily life. 
My thanks also go to all members of seminar group “data mining”. Especially, I would 
like to thank Bsc. Nguyen Thu Trang for helping me a lot in collecting data and doing 
experiments. 
I highly acknowledge the invaluable support and advice in both technical and daily life of 
my teachers, my colleagues in Department of Information Systems, Faculty of 
Technology, Vietnam National University, Hanoi 
I also want to thank the supports from the Project QC.06.07 “Vietnamese Named Entity 
Resolution and Tracking crossover Web Documents”, Vietnam National University, 
Hanoi; the Project 203906 “`Information Extraction Models for finding Entities and 
Semantic Relations in Vietnamese Web Pages'' of the Ministry of Science and 
Technology, Vietnam; and the National Project 02/2006/HĐ - ĐTCT-KC.01/06-10 
“Developing content filter systems to support management and implementation public 
security – ensure policy” 
Finally, from bottom of my heart, I would specially like to say thanks to all members in 
my family, all my friends. They are really an endless encouragement in my life.  
Nguyen Cam Tu 
ii 
Assurance 
I certify that the achievements in this thesis belong to my personal, and are not copied 
from any other’s results. Throughout the dissertation, all the mentions are either my 
proposal, or summarized from many sources. All the references have clear origins, and 
properly quoted. I am responsible for this statement. 
 Hanoi, November 15, 2007 
 Nguyen Cam Tu 
iii 
Table of Content 
Introduction 1 
Chapter 1. The Problem of Modeling Text Corpora and Hidden Topic Analysis 3 
1.1. Introduction 3 
1.2. The Early Methods 5 
1.2.1. Latent Semantic Analysis 5 
1.2.2. Probabilistic Latent Semantic Analysis 8 
1.3. Latent Dirichlet Allocation 11 
1.3.1. Generative Model in LDA 12 
1.3.2. Likelihood 13 
1.3.3. Parameter Estimation and Inference via Gibbs Sampling 14 
1.3.4. Applications 17 
1.4. Summary 17 
Chapter 2. Frameworks of Learning with Hidden Topics 19 
2.1. Learning with External Resources: Related Works 19 
2.2. General Learning Frameworks 20 
2.2.1. Frameworks for Learning with Hidden Topics 20 
2.2.2. Large-Scale Web Collections as Universal Dataset 22 
2.3. Advantages of the Frameworks 23 
2.4. Summary 23 
Chapter 3. Topics Analysis of Large-Scale Web Dataset 24 
3.1. Some Characteristics of Vietnamese 24 
3.1.1. Sound 24 
3.1.2. Syllable Structure 26 
3.1.3. Vietnamese Word 26 
3.2. Preprocessing and Transformation 27 
3.2.1. Sentence Segmentation 27 
iv 
3.2.2. Sentence Tokenization 28 
3.2.3. Word Segmentation 28 
3.2.4. Filters 28 
3.2.5. Remove Non Topic-Oriented Words 28 
3.3. Topic Analysis for VnExpress Dataset 29 
3.4. Topic Analysis for Vietnamese Wikipedia Dataset 30 
3.5. Discussion 31 
3.6. Summary 32 
Chapter 4. Deployments of General Frameworks 33 
4.1. Classification with Hidden Topics 33 
4.1.1. Classification Method 33 
4.1.2. Experiments 36 
4.2. Clustering with Hidden Topics 40 
4.2.1. Clustering Method 40 
4.2.2. Experiments 45 
4.3. Summary 49 
Conclusion 50 
Achievements throughout the thesis 50 
Future Works 50 
References 52 
Vietnamese References 52 
English References 52 
Appendix: Some Clustering Results 56  
v 
List of Figures 
Figure 1.1. Graphical model representation of the aspect model in the asymmetric (a) and 
symmetric (b) parameterization. ( [55]) 9 
Figure 1.2. Sketch of the probability sub-simplex spanned by the aspect model ( [55]) 10 
Figure 1.3. Graphical model representation of LDA - The boxes is “plates” representing 
replicates. The outer plate represents documents, while the inner plate represents the 
repeated choice of topics and words within a document [20] 12
 Figure 1.4. Generative model for Latent Dirichlet allocation; Here, Dir, Poiss and Mult 
stand for Dirichlet, Poisson, Multinomial distributions respectively 13 
Figure 1.5. Quantities in the model of latent Dirichlet allocation 13 
Figure 1.6. Gibbs sampling algorithm for Latent Dirichlet Allocation 16 
Figure 2.1. Classification with Hidden Topics 20 
Figure 2.2. Clustering with Hidden Topics 21 
Figure 3.1. Pipeline of Data Preprocessing and Transformation 27 
Figure 4.1. Classification with VnExpress topics 33 
Figure 4.2 Combination of one snippet with its topics: an example 35 
Figure 4.3. Learning with different topic models of VnExpress dataset; and the baseline 
(without topics) 37 
Figure 4.4. Test-out-of train with increasing numbers of training examples. Here, the 
number of topics is set at 60topics 37 
Figure 4.5 F1-Measure for classes and average (over all classes) in learning with 60 
topics 39 
Figure 4.6. Clustering with Hidden Topics 40 
Figure 4.7. Dendrogram in Agglomerative Hierarchical Clustering 42 
Figure 4.8 Precision of top 5 (and 10, 20) in best clusters for each query 47 
Figure 4.9 Coverage of the top 5 (and 10) good clusters for each query 47 
vi 
List of Tables 
Table 3.1. Vowels in Vietnamese 24 
Table 3.2. Tones in Vietnamese 25 
Table 3.3. Consonants of hanoi variety 26 
Table 3.4. Structure of Vietnamese syllables 26 
Table 3.5. Functional words in Vietnamese 29 
Table 3.6. Statistics of topics assigned by humans in VnExpress Dataset 29 
Table 3.7. Statistics of VnExpress dataset 30 
Table 3.8 Most likely words for sample topics. Here, we conduct topic analysis with 100 
topics 30 
Table 3.9. Statistic of Vietnamese Wikipedia Dataset 31 
Table 3.10 Most likely words for sample topics. Here, we conduct topic analysis with 200 
topics 31 
Table 4.1 Google search results as training and testing dataset. The search phrases for 
training and test data are designed to be exclusive 34 
Table 4.2. Experimental results of baseline (learning without topics) 38 
Table 4.3. Experimental results of learning with 60 topics of VnExpress dataset 38 
Table 4.4. Some collocations with highest values of chi-square statistic 44 
Table 4.5. Queries submitted to Google 45 
Table 4.6. Parameters for clustering web search results 46 
vii 
Notations & Abbreviations  
Word or phrase Abbreviation 
Information Retrieval IR 
Latent Semantic Analysis LSA 
Probability Latent Semantic Analysis PLSA 
Latent Dirichlet Allocation LDA 
Dynamic Topic Models DTM 
Correlated Topic Models CTM 
Singular Value Decomposition SVD 
1 
Introduction 
The World Wide Web has influenced many aspects of our lives, changing the way we 
communicate, conduct business, shop, entertain, and so on. However, a large portion of 
the Web data is not organized in systematic and well structured forms, a situation which 
causes great challenges to those seeking for information on the Web. Consequently, a lot 
of tasks enabling users to search, navigate and organize web pages in a more effective 
way have been posed in the last decade, such as searching, page rank, web clustering, text 
classification, etc. To this end, there have been a lot of successful stories like Google, 
Yahoo, Open Directory Project (Dmoz), Clusty, just to name but a few. 
Inspired by this trend, the aim of this thesis is to develop efficient systems which 
are able to overcome the difficulties of dealing with sparse data. The main motivation is 
that while being overwhelmed by a huge amount of online data, we sometimes lack data 
to search or learn effectively. Let take web search clustering as an example. In order to 
meet the real-time condition, that is the response time must be short enough, most of 
online clustering systems only work with small pieces of text returned from search 
engines. Unfortunately those pieces are not long and rich enough to build a good 
clustering system. A similar situation occurs in the case of searching images only based 
on captions. Because image captions are only very short and sparse chunks of text, most 
of the current image retrieval systems still fail to achieve high accuracy. As a result, much 
effort has been made recently to take advantage of external resources like learning with 
knowledge-base support, semi-supervised learning, etc. in order to improve the accuracy. 
These approaches, however, have some difficulties: (1) constructing a knowledge base is 
very time-consuming & labor-intensive, and (2) the results of semi-supervised learning in 
one application cannot be reused in another one even in the same domain. 
In the thesis, we introduce two general frameworks for learning with hidden topics 
discovered from large-scale data collections: one for clustering and another for 
classification. Unlike semi-supervised learning, we approach this issue from the point of 
view of text/web data analysis that is based on recently successful topic analysis models, 
such as Latent Semantic Analysis, Probabilistic-Latent Semantic Analysis, or Latent 
Dirichlet Allocation. The underlying idea of the frameworks is that for a domain we 
collect a very large external data collection called “universal dataset”, and then build the 
learner on both the original data (like snippets or image captions) and a rich set of hidden 
topics discovered from the universal data collection. The general frameworks are flexible 
2 
and general enough to apply for a wide range of domains and languages. Once we analyze 
a universal dataset, the resulting hidden topics can be used for several learning tasks in the 
same domain. This is also particularly useful for sparse data mining. Sparse data like 
snippets returned from a search engine can be expanded and enriched with hidden topics. 
Thus, a better performance can be achieved. Moreover, because the method can learn with 
smaller data (the meaningful hidden topics rather than all unlabeled data), it requires less 
computational resources than semi-supervised learning. 
Roadmap: The organization of this thesis is follow 
Chapter 1 reviews some typical topic analysis methods such as Latent Semantic Analysis, 
Probabilistic Latent Semantic Analysis, and Latent Dirichlet Allocation. These models 
can be considered the basic building blocks of general framework of probabilistic 
modeling of text and be used to develop more sophisticated and application-oriented 
models, such as hierarchical models, author-role models, entity models, and so on. They 
can also be considered key components in our proposals in subsequent chapters. 
Chapter 2 introduces two general frameworks for learning with hidden topics: one for 
classification and one for clustering. These frameworks are flexible and general enough to 
apply in many domains of applications. The key common phrase between the two 
frameworks is topic analysis for large-scale collections of web documents. The quality of 
the hidden topic described in this chapter will much influence the performance of 
subsequent stages. 
Chapter 3 summarizes some major issues for analyzing data collections of Vietnamese 
documents/Web pages. We first review some characteristics of Vietnamese which are 
considered significant for data preprocessing and transformation in the subsequent 
processes. Next, we discuss more details about each step of preprocessing and 
transforming data. Important notes, including specific characteristics of Vietnamese are 
highlighted. Also, we demonstrate the results from topic analysis using LDA for the 
clean, preprocessed dataset. 
Chapter 4 describes the deployments of general frameworks proposed in Chapter 2 for 2 
tasks: search result classification, and search result clustering. The two implementations 
are based on the topic model analyzed from a universal dataset like shown in chapter 3. 
The Conclusion sums up the achievements throughout the previous four chapters. Some 
future research topics are also mentioned in this section. 
3 
Chapter 1. The Problem of Modeling Text Corpora and Hidden 
Topic Analysis 
1.1. Introduction 
The goal of modeling text corpora and other collections of discrete data is to find short 
description of the members of a collection that enable efficient processing of large 
collections while preserving the essential statistical relationships that are useful for basis 
tasks such as classification, clustering, summarization, and similarity and relevance 
judgments. 
Significant achievements have been made on this problem by researchers in the context of 
information retrieval (IR). Vector space model [48] (Salton and McGill, 1983) – a 
methodology successfully deployed in modern search technologies - is a typical approach 
proposed by IR researchers for modeling text corpora. In this model, documents are 
represented as vectors in a multidimensional Euclidean space. Each axis in this space 
corresponds to a term (or word). The i-th coordinate of a vector represents some functions 
of times of the i-th term occurs in the document represented by the vector. The end result 
is a term-by-document matrix X whose columns contain the coordinates for each of the 
documents in the corpus. Thus, this model reduces documents of arbitrary length to fixed-
length lists of numbers. 
While the vector space model has some appealing features – notably in its basis 
identification of sets of words that are discriminative for documents in the collection – the 
approach also provides a relatively small amount of reduction in description length and 
reveals little in the way of inter- or intra- document statistical structure. To overcome 
these shortcomings, IR researchers have proposed some other modeling methods such as 
generalized vector space model, topic-based vector space model, etc., among which latent 
semantic analysis (LSA - Deerwester et al, 1990)[13][26] is the most notably. LSA uses a 
singular value decomposition of the term-by-document X matrix to identify a linear 
subspace in the space of term weight features that captures most of variance in the 
collection. This approach can achieve considerable reduction in large collections. 
Furthermore, Deerwester et al argue that this method can reveal some aspects of basic 
linguistic notions such as synonymy or polysemy. 
In 1998, Papadimitriou et al [40] developed a generative probabilistic model of text 
corpora to study the ability of recovering aspects of the generative model from data in 
LSA approach. However, once we have a generative model in hand, it is not clear why we 
4 
should follow the LSI approach – we can attempt to proceed more directly, fitting the 
model to data using maximum likelihood or Bayesian methods. 
The probabilistic LSI (PLSI - Hoffman, 1999) [21] [22] is a significant step in this regard. 
The pLSI models each word in a document as a sample from a mixture model, where each 
mixture components are multinomial random variables that can be viewed as 
representation of “topics”. Consequently, each word is generated from a single topic, and 
different words in a document may be generated from different topics. Each document is 
represented as a probability distribution over a fixed set of topics. This distribution can be 
considered as a “reduced description” associated with the document. 
While Hofmann’s work is a useful step toward probabilistic text modeling, it suffers from 
severe overfitting problems. The number of parameters grows linearly with the number of 
documents. Additionally, although pLSA is a generative model of the documents in the 
collection it is estimated on, it is not a generative model of new documents. Latent 
Dirichlet Allocation (LDA) [5][20] proposed by Blei et. al. (2003) is one solution to these 
problems. Like all of the above methods, LDA bases on the “bag of word” assumption – 
that the order of words in a document can be neglected. In addition, although less often 
stated formally, these methods also assume that documents are exchangeable; the specific 
ordering of the documents in a corpus can also be omitted. According to de Finetti (1990), 
any collection of exchangeable random variables can be represented as a mixture 
distribution – in general an infinite mixture. Thus, if we wish to consider exchangeable 
representations for documents and words, we need to consider mixture models that 
capture the exchangeability of both words and documents. This is the key idea of LDA 
model that we will consider carefully in the section 1.3. 
In recent time, Blei et al have developed the two extensions to LDA. They are Dynamic 
Topic Models (DTM - 2006)[7] and Correlated Topic Models (CTM - 2007) [8]. DTM is 
suitable for time series data analysis thanks to the non-exchangeability nature of modeling 
documents. On the other hand, CTM is capable of revealing topic correlation, for 
example, a document about genetics is more likely to also be about disease than X-ray 
astronomy. Though the CTM gives a better fit of the data in comparison to LDA, it is so 
complicated by the fact that it loses the conjugate relationship between prior distribution 
and likelihood. 
In the following sections, we will discuss more about the issues behind these modeling 
methods with particular attention to LDA – a well-known model that has shown its 
efficiency and success in many applications. 
5 
1.2. The Early Methods 
1.2.1. Latent Semantic Analysis 
The main challenge of machine learning systems is to determine the distinction between 
the lexical level of “what actually has been said or written” and the semantic level of 
“what is intended” or “what was referred to” in the text or utterance. This problem lies in 
twofold: (i) polysemy, i.e., a word has multiple meaning and multiple types of usage in 
different context, and (ii), synonymy and semantically related words, i.e, different words 
mat have similar sense. They at least in certain context specify the same concept or the 
same topic in a weaker sense. 
Latent semantic analysis (LSA - Deerwester et al, 1990) [13][24][26] is the well-known 
technique which partially addresses this problem. The key idea is to map from the 
document vectors in word space to a lower dimensional representation in the so-called 
concept space or latent semantic space. Mathematically, LSA relies on singular value 
decomposition (SVD), a well-known factorization method in linear algebra. 
a. Latent Semantic Analysis by SVD 
In the first step, we present the text corpus as term-by-document matrix where elements 
(i, j) describes the occurrences of term i in document j. Let X be such a matrix, X will look 
like this: 
⎥
⎥
⎥
⎦
⎤
⎢
⎢
⎢
⎣
⎡
→
↓
nmm
n
xx
xx
,1,
,11,1
T
i
j
t
d
L
MOM
L
 Now a row in this matrix is a vector corresponding to a term, giving its relation to each 
document: 
[
]
nii
xx
,1,
T
j
 t L= 
 Likewise, a column in this matrix will be a vector corresponding to a document, giving 
its relation to each term: 
[
]
jmj
T
j
xxd
,,1
 L= 
Now, the dot product between two term vectors gives us the correlation between the 
terms over the documents. The matrix product 
pi
tt
T
T
XX
contains all these dot products. 
6 
Element (i, p) (which equal to element (p,i) due to the symmetry) contains the dot product 
. Similarly, the matrix 
)tt(tt
ipp
T
i
T
=
X
X
T
contains the dot products between all the 
document vectors, giving their correlation over the terms: 
j
T
qq
T
j
dddd =
In the next step, we conduct the standard SVD for the 
X
matrix and get , where 
U and V are orthogonal matrices and the diagonal matrix 
Σ contains the 
singular values of X. The matrix products giving us the term and document correlations 
are then become and respectively. 
T
VUX Σ=
IVVUU
TT
==
TTT
VUXX ΣΣ=
TTT
UVXX ΣΣ=
 Since and are diagonal we see that 
U
must contain the eigenvectors 
of
T
ΣΣ ΣΣ
T
T
XX
, while 
V
must be the eigenvectors of
X
X
T
. Both products have the same non-zero 
eigenvalues, given by the non-zero entries of
T
Σ
Σ , or equally, the non-zero entries of
Σ
Σ
T
. 
Now the decomposition looks like this:  
The values 
l
σ
σ
, ,
1
 are called the singular values, and and the left and 
right singular vectors. Note that only part of 
U
, which contributes to , is the i-th row. 
Let this row vector be called . Likewise, the only part of that contributes to is the 
j’th column, . These are not the eigenvectors, but depend on all the eigenvectors. 
l
uu , ,
1 l
vv , ,
1
i
t
i
t
ˆ
T
V
j
d
j
d
ˆ
The LSA approximation of X is computed by selecting k largest singular values, and their 
corresponding singular vectors from U and V. This results in the rank k approximation to 
X with the smallest error. The appealing thing in this approximation is that not only does 
it have the minimal error, but it translates the terms and document vectors into a concept 
space. The vector then has k entries, each gives the occurrence of term i in one of the k 
concepts. Similarly, the vector gives the relation between document j and each concept. 
We write this approximation as . Based on this approximation, we can now 
do the following: 
i
t
ˆ
j
d
ˆ
T
kkkk
VUX Σ=
- See how related documents j and q are in the concept space by comparing the 
vectors 
j
d
ˆ
 and 
q
d
ˆ
(usually by cosine similarity). This gives us a clustering of the 
documents. 
7 
- Comparing terms i and p by comparing the vectors 
i
t
ˆ
and
n
t
ˆ
, giving us a clustering 
of the terms in the concept space. 
- Given a query, view this as a mini document, and compare it to your documents in 
the concept space. 
To do the latter, we must first translate your query into the concept space with the same 
transformation used on the documents, i.e. and . This means 
that if we have a query vector, we must do the translation before comparing it 
to the document vectors in the concept space. 
jkkj
dUd
ˆ
Σ=
j
T
kkj
dUd
1
ˆ
−
Σ=
qUq
T
kk
1
ˆ
−
Σ=
b. Applications 
The new concept space typically can be used to: 
- Compare the documents in the latent semantic space. This is useful to some typical 
learning tasks such as data clustering or document classification. 
- Find similar documents across languages, after analyzing a base set of translated 
documents. 
- Find relations between terms (synonymy and polysemy). Synonymy and polysemy 
are fundamental problems in natural language processing: 
o Synonymy is the phenomenon where different words describe the same 
idea. Thus, a query in a search engine may fail to retrieve a relevant 
document that does not contain the words which appeared in the query. 
o Polysemy is the phenomenon where the same word has multiple meanings. 
So a search may retrieve irrelevant documents containing the desired words 
in the wrong meaning. For example, a botanist and a computer scientist 
looking for the word "tree" probably desire different sets of documents. 
- Given a query of terms, we could translate it into the concept space, and find 
matching documents (information retrieval). 
c. Limitations 
LSA has two drawbacks: 
- The resulting dimensions might be difficult to interpret. For instance, in 
{(car), (truck), (flower)} > {(1.3452 * car + 0.2828 * truck), (flower)} 
the (1.3452 * car + 0.2828 * truck) component could be interpreted as "vehicle". 
However, it is very likely that cases close to 
8 
}
}
{(car), (bottle), (flower)} > {(1.3452 * car + 0.2828 * bottle), (flower)} 
will occur. This leads to results which can be justified on the mathematical level, 
but have no interpretable meaning in natural language. 
- The probabilistic model of LSA does not match observed data: LSA assumes that 
words and documents form a joint Gaussian model (ergodic hypothesis), while a 
Poisson distribution has been observed. Thus, a newer alternative is probabilistic 
latent semantic analysis, based on a multinomial model, which is reported to give 
better results than standard LSA. 
1.2.2. Probabilistic Latent Semantic Analysis 
Probabilistic Latent Semantic Analysis [21][22] (PLSA) is a statistical technique for 
analysis of two-mode and co-occurrence data which has applications in information 
retrieval and filtering, natural language processing, machine learning from text and in 
related areas. Compared to standard LSA, PLSA is based on a mixture decomposition 
derived from a latent class model. This results in a more principled approach which has a 
solid foundation in statistics. 
a. The Aspect Model 
Suppose that we have given a collection of text documents with terms 
from a vocabulary . The starting point for PLSA is a statistical model 
namely aspect model. The aspect model is a latent variable model for co-occurrence data 
in which an unobserved variable 
{
N
ddD , ,
1
=
{
M
wwW , ,
1
=
{
}
K
zzZz , ,
1
=
∈
 is introduced to capture the hidden 
topics implied in the documents. Here, N, M and K are the number of documents, words, 
and topics respectively. Hence, we model the joint probability over by the mixture 
as follows: 
DxW
∑
∈
==
Zz
dzPzwPdwPdwPdPwdP )|()|()|( ),|()(),(
 (1.1) 
Like virtually all statistical latent variable models the aspect model relies on a conditional 
independence assumption, i.e. d and w are independent conditioned on the state of the 
associated latent variable (the graphical model representing this is demonstrated in Figure 
1.1(a)) 
9  
Figure 1.1. Graphical model representation of the aspect model in the asymmetric (a) and symmetric (b) 
parameterization. ( [53]) 
It is necessary to note that the aspect model can be equivalently parameterized by (cf. 
Figure 1.1 (b)) 
∑
∈
=
Zz
zwPzdPzPwdP )|()|()(),(
 (1.2) 
This is perfectly symmetric with respect to both documents and words. 
b. Model Fitting with the Expectation Maximization Algorithm 
The aspect model is estimated by the traditional procedure for maximum likelihood 
estimation, i.e. Expectation Maximization. EM iterates two coupled steps: (i) an 
expectation (E) step in which posterior probabilities are computed for the latent variables; 
and (ii) a maximization (M) step where parameters are updated. Standard calculations 
give us the E-step formulae 
∑
∈
′
′′′
=
Zz
zwPzdPzP
zwPzdPzP
wdzP
)|()|()(
)|()|()(
),|(
 (1.3) 
As well as the following M-step equation 
∑
∈
∝
Dd
wdzPwdnzwP ),|(),( )|(
 (1.4) 
∑
∈
∝
Ww
wdzPwdnzdP ),|(),( )|(
 (1.5) 
∑∑
∈∈
∝
DdWw
wdzPwdnzP ),|(),( )(
 (1.6) 
c. Probabilistic Latent Semantic Space 
10 
Let us consider topic-conditional multinomial distribution over vocabulary as 
points on the 
)|(. zp
1−
M
 dimensional simplex of all possible multinomial. Via convex hull, the 
K points define a 
1−≤
K
L
 dimensional sub-simplex. The modeling assumption 
expressedby (1.1) is that conditional distributions for all documents are 
approximated by a multinomial representable as a convex combination of in 
which the mixture component uniquely define a point on the spanned sub-simplex 
which can identified with a concept space. A simple illustration of this idea is shown in 
)|( dwP
)|( zwP
)|( dzP
Figure 1.2.  
Figure 1.2. Sketch of the probability sub-simplex spanned by the aspect model ( [53]) 
In order to clarify the relation to LSA, it is useful to reformulate the aspect model as 
parameterized by (1.2) in matrix notation. By defining , 
 and 
(
ki
ki
zdPU
,,
)|(
ˆ
 =
)
()()
kj
kj
zwPV
,
|
ˆ
=
(
)(
k
k
zPdiag=Σ
ˆ
)
 matrices, we can write the joint probability model 
P
as a matrix product . Comparing this with SVD, we can draw the following 
observations: (i) outer products between rows of 
U
and reflect conditional 
independence in PLSA, (ii) the mixture proportions in PLSA substitute the singular 
values. Nevertheless, the main difference between PLSA and LSA lies on the objective 
function used to specify the optimal approximation. While LSA uses or Frobenius 
norm which corresponds to an implicit additive Gaussian noise assumption on counts, 
PLSA relies on the likelihood function of multinomial sampling and aims at an explicit 
maximization of the predictive power of the model. As is well known, this corresponds to 
a minimization of the cross entropy or Kullback - Leibler divergence between empirical 
distribution and the model, which is very different from the view of any types of squared 
deviation. On the modeling side, this offers crucial advantages, for example, the mixture 
approximation 
T
VUP
ˆ
ˆ
ˆ
Σ=
ˆ
V
ˆ
2
L
P
of the term-by-document matrix is a well-defined probability 
11 
distribution. IN contrast, LSA does not define a properly normalized probability 
distribution and the approximation of term-by-document matrix may contain negative 
entries. In addition, there is no obvious interpretation of the directions in the LSA latent 
space, while the directions in the PLSA space are interpretable as multinomial word 
distributions. The probabilistic approach can also take advantage of the well-established 
statistical theory for model selection and complexity control, e.g., to determine the 
optimal number of latent space dimensions. Choosing the number of dimensions in LSA 
on the other hand is typically based on ad hoc heuristics. 
d. Limitations 
In the aspect model, notice that is a dummy index into the list of documents in the 
training set. Consequently, d is a multinomial random variable with as many possible 
values as there are training documents and the model learns the topic mixtures 
only for those documents on which it is trained. For this reason, pLSI is not a well-
defined generative model of documents; there is no natural way to assign probability to a 
previously unseen document. 
d
)|( dzp
A further difficulty with pLSA, which also originate from the use of a distribution 
indexed by training documents, is that the numbers of parameters grows linearly with the 
number of training documents. The parameters for a K-topic pLSI model are K 
multinomial distributions of size V and M mixtures over the K hidden topics. This gives 
KV + KM parameters and therefore linear growth in M. The linear growth in parameters 
suggests that the model is prone to overfitting and, empirically, overfitting is indeed a 
serious problem. In practice, a tempering heuristic is used to smooth the parameters of the 
model for acceptable predictive performance. It has been shown, however, that overfitting 
can occur even when tempering is used (Popescul et al., 2001, [41]). 
Latent Dirichlet Allocation (LDA - which is described in section 1.3. overcomes both of 
these problems by treating the topic mixture weights as a K-parameter hidden random 
variable rather than a large set of individual parameters which are explicitly linked to the 
training set. 
1.3. Latent Dirichlet Allocation 
Latent Dirichlet Allocation (LDA) [7][20] is a generative probabilistic model for 
collections of discrete data such as text corpora. It was developed by David Blei, Andrew 
Ng, and Michael Jordan in 2003. By nature, LDA is a three-level hierarchical Bayesian 
model in which each item of a collection is modeled as a finite mixture over an 
underlying set of topics. Each topic, in turn, modeled as an infinite mixture over an 
12 
underlying set of topic probabilities. In the context of text modeling, the topic 
probabilities provide an explicit representation of a document. In the following sections, 
we will discuss more about generative model, parameter estimation as well as inference in 
LDA. 
1.3.1. Generative Model in LDA 
Given a corpus of M documents denoted by
{
}
M
dddD , ,,
21
=
, in which each document 
number m in the corpus consists of N
m
 words drawn from a vocabulary of terms 
, the goal of LDA is to find the latent structure of “topics” or “concepts” which 
captured the meaning of text that is imagined to be obscured by “word choice” noise. 
Though the terminology of “hidden topics” or “latent concepts” has been encountered in 
LSA and pLSA, LDA provides us a complete generative model that has shown better 
results than the earlier approaches. 
i
w
{
V
tt , ,
1
}
Consider the graphical model representation of LDA as shown in Figure 1.3, the 
generative process can be interpreted as follows: LDA generates a stream of observable 
words , partitioned into documents
nm
w
,
m
d
r
. For each of these documents, a topic 
proportion 
m
ϑ
r
is drawn, and from this, topic-specific words are emitted. That is, for each 
word, a topic indicator is sampled according to the document – specific mixture 
proportion, and then the corresponding topic-specific term distribution 
nm
z
,
nm
z
,
ϕ
r
used to draw a 
word. The topics 
k
ϕ
r
are sampled once for the entire corpus. The complete (annotated) 
generative model is presented in Figure 1.4. Figure 1.5 gives a list of all involved 
quantities.  
Figure 1.3. Graphical model representation of LDA - The boxes is “plates” representing replicates. The outer 
plate represents documents, while the inner plate represents the repeated choice of topics and words within a 
document [20] 
13   
Figure 1.4. Generative model for Latent Dirichlet allocation; Here, Dir, Poiss and Mult stand for Dirichlet, 
Poisson, Multinomial distributions respectively.  
Figure 1.5. Quantities in the model of latent Dirichlet allocation 
1.3.2. Likelihood 
According to the model, the probability that a word instantiates a particular term t 
given the LDA parameters is: 
nm
w
,
∑
=
===Φ=
K
k
mnmknmmnm
kzptwptwp
1
,,,
)|()|(),|(
ϑϕϑ
r
r
r
 (1.7) 
which corresponds to one iteration on the word plate of the graphical model. From the 
topology of the graphical model, we can further specify the complete-data likelihood of a 
14 
document, i.e., the joint distribution of all known and hidden variables given the hyper 
parameters: 
43421
r
44444448444444476
r
r
4444434444421
rr
r
r
r
r
plate topic
document) (1 platedocument 
plate word
,
1
,
)|(.),(.)|()|(),|,,,(
,
βαϑϑϕβαϑ
Φ=Φ
∏
=
ppzpwpzdp
mmnmz
N
n
nmmmm
nm
m
 (1.8) 
Specifying this distribution is often simple and useful as a basic for other derivations. So 
we can obtain the likelihood of a document
m
d
r
, i.e., of the joint event of all word 
occurrences, as one of its marginal distributions by integrating out the distributions 
m
ϑ
r
and 
Φ
and summing over : 
nm
z
,
()
∫∫
∏
∑
=
ΦΦ=
m
nm
nm
N
n
z
mmnmznmmm
ddzpwpppdp
1
,,
,
,
|)|().|().|(),|(
ϑϑϕβαϑβα
rr
r
r
r
rr
r
r
 (1.9) 
()
∏
∫∫
=
ΦΦΦ=
m
N
n
mmnmm
ddwppp
1
,
),|().|(.|
ϑϑβαϑ
rrr
r
r
 (1.10) 
Finally, the likelihood of the complete corpus 
{
}
M
m
m
dW
1=
=
r
 is determined by the product of 
the likelihoods of the independent documents: 
∏
=
=
M
m
m
dpWp
1
).,|(),|(
βαβα
r
r
r
r
r
 (1.11) 
1.3.3. Parameter Estimation and Inference via Gibbs Sampling 
Exact estimation for LDA is generally intractable. The common solution to this is to use 
approximate inference algorithms such as mean-field variational expectation 
maximization, expectation propagation, and Gibbs sampling [20] 
a. Gibbs Sampling 
Gibbs sampling is a special case of Markov-chain Monte Carlo (MCMC) and often yields 
relatively simple algorithms for approximate inference in high-dimensional models such 
as LDA. Through the stationary behavior of a Markov chain, MCMC methods can 
emulate high-dimensional probability distributions
)(xp
r
. This means that one sample is 
generated for each transition in the chain after a stationary state of the chain has been 
reached, which happens after a so-called “burn-in period” that eliminates the influence of 
initialization parameters. In Gibbs sampling, the dimensions of the distribution are 
i
x
15 
r
sampled alternately one at a time, conditioned on the values of all other dimensions, 
which we denote . The algorithm works as follows: 
i
x
−
1. Choose dimension i (random or by permutation) 
2. Sample 
i
x from )|(
ii
xx p
−
r
 To build a Gibbs sampler, the full conditionals 
)|(
ii
xxp
−
r
 must be calculated, which is 
possible using 
(
)
()
{
ii
i
ii
xxx
dxxp
xp
xxp
−−
==
∫
rr
r
}
r
r
, with )|(
 (1.12) 
For models that contain hidden variables
z
r
, their posterior given the evidence,
(
)
xzp
r
r
|
, is a 
distribution commonly wanted. With Eq. 1.12, the general formulation of a Gibbs sampler 
for such latent-variable models becomes: 
()
(
)
()
∫
=
−
Z
i
ii
dzxzp
xzp
xzzp
r
r
r
r
r
r
,
,
,|
 (1.13) 
where the integral changes to a sum for discrete variables. With a sufficient number of 
samples
[
Rrz
r
,1,
]
~
∈
r
, the latent-variable posterior can be approximated using: 
()
(
)
∑
=
−≈
R
r
r
zz
R
xzp
1
~
1
|
rr
r
r
δ
 (1.14) 
With the Kronecker delta 
()
{
}
otherwise 0 ;0 if 1
=
=
uu
r
r
δ 
b. Parameter Estimation 
Heirich et al [20] has applied the above hidden-variable method to develop a Gibbs 
sampler for LDA as shown in Figure 1.6. The hidden variables here are , i.e., the 
topics that appear with the words of the corpus . We do not need to include the 
parameter sets 
nm
z
,
nm
w
,
Θ
and 
Φ
because they are just the statistics of the associations between the 
observed and the corresponding , the state variables of the Markov chain. 
nm
w
,
nm
z
,
Heirich [20] has been shown a sequence of calculations to lead to the formulation of the 
full conditionals for LDA as follows: 
()
(
)
(
)
()
11
,|
1
,
1
,
−
⎥
⎦
⎤
⎢
⎣
⎡
+
+
−
⎥
⎦
⎤
⎢
⎣
⎡
+
+
=
∑
∑
=
−
=
−
−
z
K
z
z
m
z
z
im
v
V
v
v
z
t
t
iz
ii
n
n
n
n
wzzp
α
α
β
β
r
r
 (1.15) 
16 
The other hidden variables of LDA can be calculated as follows: 
(
)
()
v
V
v
v
k
t
t
k
tk
n
n
β
β
ϕ
+
+
=
∑
=1
,
 (1.16) 
(
)
()
z
K
z
z
m
k
k
m
km
n
n
α
α
ϑ
+
+
=
∑
=1
,
 (1.17) 
- Initialization 
zero all count variables, , ,
()
z
m
n
m
n
(
)
t
z
n , 
z
n
for all documents do 
[]
Mm ,1∈ 
for all words in document do 
[
m
Nn ,1∈
]
m
 sample topic index ~Mult(1/
K) 
nm
z
,
 increment document-topic count: 
(
)
1+
s
m
n 
 increment document-topic sum: 1
+
m
n 
 increment topic-term count: 
(
)
1+
t
s
n 
 increment topic-term sum: 
1
+
z
n  
end for 
end for 
]
- Gibbs sampling over burn-in period and sampling period 
while not finished do  
for all documents do 
[]
Mm ,1∈
 for all words in document do 
[
m
Nn ,1∈
m
 - for the current assignment of to a term t for word : 
z
nm
w
,
 decrement counts and sums: 
(
)
1−
z
m
n
;1
−
m
n ;
(
)
1−
t
z
n ;
1
−
z
n 
 - multinomial sampling acc. To Eq. 1.15 (decrements from previous step): 
 sample topic index 
(
)
wzzpz
ii
r
r
,|~
~
− 
 - use the new assignment of to the term 
t for word to: z
nm
w
,
 increment counts and sums: 
(
)
1+
z
m
n
r
; ;1+
t
z
n
r
1
+
z
n
r  
end for 
 end for 
- check convergence and read out parameters 
 if converged and L sampling iterations since last read out then 
 - the different parameters read outs are averaged 
 read out parameter set 
Φ
acc. to Eq. 1.16 
 read out parameter set 
Θ
acc. to Eq. 1.17  
end if 
end while 
Figure 1.6. Gibbs sampling algorithm for Latent Dirichlet Allocation.