My deepest thank must first go to my research advisor, Prof. Dr. Ha Quang Thuy, who offers me an endless inspiration in scientific research, leading me to this research area. I particularly appreciate his unconditional support and advice in both academic environment and daily life during the last four years.
                
              
                                            
                                
            
 
            
                 67 trang
67 trang | 
Chia sẻ: vietpd | Lượt xem: 1511 | Lượt tải: 1 
              
            Bạn đang xem trước 20 trang tài liệu Luận văn Hidden topic discovery toward classification and clustering in vietnamese web documents, để xem tài liệu hoàn chỉnh bạn click vào nút DOWNLOAD ở trên
 VIET NAM NATIONAL UNIVERSITY, HANOI 
COLLEGE OF TECHNOLOGY 
NGUYEN CAM TU 
HIDDEN TOPIC DISCOVERY TOWARD 
CLASSIFICATION AND CLUSTERING IN 
VIETNAMESE WEB DOCUMENTS 
MASTER THESIS 
HANOI - 2008 
 VIET NAM NATIONAL UNIVERSITY, HANOI 
COLLEGE OF TECHNOLOGY 
NGUYEN CAM TU 
HIDDEN TOPIC DISCOVERY TOWARD 
CLASSIFICATION AND CLUSTERING IN 
VIETNAMESE WEB DOCUMENTS 
Major: Information Technology 
Specificity: Information Systems 
Code: 60 48 05 
MASTER THESIS 
SUPERVISOR: Prof. Dr. Ha Quang Thuy 
HANOI - 2008 
i 
Acknowledgements 
My deepest thank must first go to my research advisor, Prof. Dr. Ha Quang Thuy, who 
offers me an endless inspiration in scientific research, leading me to this research area. I 
particularly appreciate his unconditional support and advice in both academic 
environment and daily life during the last four years. 
Many thanks go to Dr. Phan Xuan Hieu who has given me many advices and comments. 
This work can not be possible without his support. Also, I would like to thank him for 
being my friend, my older brother who has brought me a lot of lessons in both scientific 
research and daily life. 
My thanks also go to all members of seminar group “data mining”. Especially, I would 
like to thank Bsc. Nguyen Thu Trang for helping me a lot in collecting data and doing 
experiments. 
I highly acknowledge the invaluable support and advice in both technical and daily life of 
my teachers, my colleagues in Department of Information Systems, Faculty of 
Technology, Vietnam National University, Hanoi 
I also want to thank the supports from the Project QC.06.07 “Vietnamese Named Entity 
Resolution and Tracking crossover Web Documents”, Vietnam National University, 
Hanoi; the Project 203906 “`Information Extraction Models for finding Entities and 
Semantic Relations in Vietnamese Web Pages'' of the Ministry of Science and 
Technology, Vietnam; and the National Project 02/2006/HĐ - ĐTCT-KC.01/06-10 
“Developing content filter systems to support management and implementation public 
security – ensure policy” 
Finally, from bottom of my heart, I would specially like to say thanks to all members in 
my family, all my friends. They are really an endless encouragement in my life. 
Nguyen Cam Tu 
ii 
Assurance 
I certify that the achievements in this thesis belong to my personal, and are not copied 
from any other’s results. Throughout the dissertation, all the mentions are either my 
proposal, or summarized from many sources. All the references have clear origins, and 
properly quoted. I am responsible for this statement. 
 Hanoi, November 15, 2007 
 Nguyen Cam Tu 
iii 
Table of Content 
Introduction ..........................................................................................................................1 
Chapter 1. The Problem of Modeling Text Corpora and Hidden Topic Analysis ...............3 
1.1. Introduction ...............................................................................................................3 
1.2. The Early Methods ....................................................................................................5 
1.2.1. Latent Semantic Analysis ...................................................................................5 
1.2.2. Probabilistic Latent Semantic Analysis..............................................................8 
1.3. Latent Dirichlet Allocation......................................................................................11 
1.3.1. Generative Model in LDA................................................................................12 
1.3.2. Likelihood.........................................................................................................13 
1.3.3. Parameter Estimation and Inference via Gibbs Sampling................................14 
1.3.4. Applications......................................................................................................17 
1.4. Summary..................................................................................................................17 
Chapter 2. Frameworks of Learning with Hidden Topics..................................................19 
2.1. Learning with External Resources: Related Works ................................................19 
2.2. General Learning Frameworks ................................................................................20 
2.2.1. Frameworks for Learning with Hidden Topics ................................................20 
2.2.2. Large-Scale Web Collections as Universal Dataset .........................................22 
2.3. Advantages of the Frameworks ...............................................................................23 
2.4. Summary..................................................................................................................23 
Chapter 3. Topics Analysis of Large-Scale Web Dataset ..................................................24 
3.1. Some Characteristics of Vietnamese .......................................................................24 
3.1.1. Sound................................................................................................................24 
3.1.2. Syllable Structure .............................................................................................26 
3.1.3. Vietnamese Word .............................................................................................26 
3.2. Preprocessing and Transformation..........................................................................27 
3.2.1. Sentence Segmentation.....................................................................................27 
iv 
3.2.2. Sentence Tokenization......................................................................................28 
3.2.3. Word Segmentation ..........................................................................................28 
3.2.4. Filters ................................................................................................................28 
3.2.5. Remove Non Topic-Oriented Words ...............................................................28 
3.3. Topic Analysis for VnExpress Dataset ...................................................................29 
3.4. Topic Analysis for Vietnamese Wikipedia Dataset ................................................30 
3.5. Discussion................................................................................................................31 
3.6. Summary..................................................................................................................32 
Chapter 4. Deployments of General Frameworks ..............................................................33 
4.1. Classification with Hidden Topics ..........................................................................33 
4.1.1. Classification Method.......................................................................................33 
4.1.2. Experiments ......................................................................................................36 
4.2. Clustering with Hidden Topics................................................................................40 
4.2.1. Clustering Method ............................................................................................40 
4.2.2. Experiments ......................................................................................................45 
4.3. Summary..................................................................................................................49 
Conclusion ..........................................................................................................................50 
Achievements throughout the thesis...............................................................................50 
Future Works ..................................................................................................................50 
References ..........................................................................................................................52 
Vietnamese References ..................................................................................................52 
English References .........................................................................................................52 
Appendix: Some Clustering Results...................................................................................56 
v 
List of Figures 
Figure 1.1. Graphical model representation of the aspect model in the asymmetric (a) and 
symmetric (b) parameterization. ( [55]) ...............................................................................9 
Figure 1.2. Sketch of the probability sub-simplex spanned by the aspect model ( [55])...10 
Figure 1.3. Graphical model representation of LDA - The boxes is “plates” representing 
replicates. The outer plate represents documents, while the inner plate represents the 
repeated choice of topics and words within a document [20] ............................................12 
Figure 1.4. Generative model for Latent Dirichlet allocation; Here, Dir, Poiss and Mult 
stand for Dirichlet, Poisson, Multinomial distributions respectively.................................13 
Figure 1.5. Quantities in the model of latent Dirichlet allocation......................................13 
Figure 1.6. Gibbs sampling algorithm for Latent Dirichlet Allocation..............................16 
Figure 2.1. Classification with Hidden Topics...................................................................20 
Figure 2.2. Clustering with Hidden Topics ........................................................................21 
Figure 3.1. Pipeline of Data Preprocessing and Transformation .......................................27 
Figure 4.1. Classification with VnExpress topics .............................................................33 
Figure 4.2 Combination of one snippet with its topics: an example ..................................35 
Figure 4.3. Learning with different topic models of VnExpress dataset; and the baseline 
(without topics)...................................................................................................................37 
Figure 4.4. Test-out-of train with increasing numbers of training examples. Here, the 
number of topics is set at 60topics .....................................................................................37 
Figure 4.5 F1-Measure for classes and average (over all classes) in learning with 60 
topics...................................................................................................................................39 
Figure 4.6. Clustering with Hidden Topics ........................................................................40 
Figure 4.7. Dendrogram in Agglomerative Hierarchical Clustering..................................42 
Figure 4.8 Precision of top 5 (and 10, 20) in best clusters for each query.........................47 
Figure 4.9 Coverage of the top 5 (and 10) good clusters for each query ...........................47 
vi 
List of Tables 
Table 3.1. Vowels in Vietnamese.......................................................................................24 
Table 3.2. Tones in Vietnamese .........................................................................................25 
Table 3.3. Consonants of hanoi variety ..............................................................................26 
Table 3.4. Structure of Vietnamese syllables ....................................................................26 
Table 3.5. Functional words in Vietnamese .......................................................................29 
Table 3.6. Statistics of topics assigned by humans in VnExpress Dataset.........................29 
Table 3.7. Statistics of VnExpress dataset .........................................................................30 
Table 3.8 Most likely words for sample topics. Here, we conduct topic analysis with 100 
topics...................................................................................................................................30 
Table 3.9. Statistic of Vietnamese Wikipedia Dataset ......................................................31 
Table 3.10 Most likely words for sample topics. Here, we conduct topic analysis with 200 
topics...................................................................................................................................31 
Table 4.1 Google search results as training and testing dataset. The search phrases for 
training and test data are designed to be exclusive ............................................................34 
Table 4.2. Experimental results of baseline (learning without topics)...............................38 
Table 4.3. Experimental results of learning with 60 topics of VnExpress dataset.............38 
Table 4.4. Some collocations with highest values of chi-square statistic ..........................44 
Table 4.5. Queries submitted to Google.............................................................................45 
Table 4.6. Parameters for clustering web search results ....................................................46 
vii 
Notations & Abbreviations 
Word or phrase Abbreviation 
Information Retrieval IR 
Latent Semantic Analysis LSA 
Probability Latent Semantic Analysis PLSA 
Latent Dirichlet Allocation LDA 
Dynamic Topic Models DTM 
Correlated Topic Models CTM 
Singular Value Decomposition SVD 
1 
Introduction 
The World Wide Web has influenced many aspects of our lives, changing the way we 
communicate, conduct business, shop, entertain, and so on. However, a large portion of 
the Web data is not organized in systematic and well structured forms, a situation which 
causes great challenges to those seeking for information on the Web. Consequently, a lot 
of tasks enabling users to search, navigate and organize web pages in a more effective 
way have been posed in the last decade, such as searching, page rank, web clustering, text 
classification, etc. To this end, there have been a lot of successful stories like Google, 
Yahoo, Open Directory Project (Dmoz), Clusty, just to name but a few. 
Inspired by this trend, the aim of this thesis is to develop efficient systems which 
are able to overcome the difficulties of dealing with sparse data. The main motivation is 
that while being overwhelmed by a huge amount of online data, we sometimes lack data 
to search or learn effectively. Let take web search clustering as an example. In order to 
meet the real-time condition, that is the response time must be short enough, most of 
online clustering systems only work with small pieces of text returned from search 
engines. Unfortunately those pieces are not long and rich enough to build a good 
clustering system. A similar situation occurs in the case of searching images only based 
on captions. Because image captions are only very short and sparse chunks of text, most 
of the current image retrieval systems still fail to achieve high accuracy. As a result, much 
effort has been made recently to take advantage of external resources like learning with 
knowledge-base support, semi-supervised learning, etc. in order to improve the accuracy. 
These approaches, however, have some difficulties: (1) constructing a knowledge base is 
very time-consuming & labor-intensive, and (2) the results of semi-supervised learning in 
one application cannot be reused in another one even in the same domain. 
In the thesis, we introduce two general frameworks for learning with hidden topics 
discovered from large-scale data collections: one for clustering and another for 
classification. Unlike semi-supervised learning, we approach this issue from the point of 
view of text/web data analysis that is based on recently successful topic analysis models, 
such as Latent Semantic Analysis, Probabilistic-Latent Semantic Analysis, or Latent 
Dirichlet Allocation. The underlying idea of the frameworks is that for a domain we 
collect a very large external data collection called “universal dataset”, and then build the 
learner on both the original data (like snippets or image captions) and a rich set of hidden 
topics discovered from the universal data collection. The general frameworks are flexible 
2 
and general enough to apply for a wide range of domains and languages. Once we analyze 
a universal dataset, the resulting hidden topics can be used for several learning tasks in the 
same domain. This is also particularly useful for sparse data mining. Sparse data like 
snippets returned from a search engine can be expanded and enriched with hidden topics. 
Thus, a better performance can be achieved. Moreover, because the method can learn with 
smaller data (the meaningful hidden topics rather than all unlabeled data), it requires less 
computational resources than semi-supervised learning. 
Roadmap: The organization of this thesis is follow 
Chapter 1 reviews some typical topic analysis methods such as Latent Semantic Analysis, 
Probabilistic Latent Semantic Analysis, and Latent Dirichlet Allocation. These models 
can be considered the basic building blocks of general framework of probabilistic 
modeling of text and be used to develop more sophisticated and application-oriented 
models, such as hierarchical models, author-role models, entity models, and so on. They 
can also be considered key components in our proposals in subsequent chapters. 
Chapter 2 introduces two general frameworks for learning with hidden topics: one for 
classification and one for clustering. These frameworks are flexible and general enough to 
apply in many domains of applications. The key common phrase between the two 
frameworks is topic analysis for large-scale collections of web documents. The quality of 
the hidden topic described in this chapter will much influence the performance of 
subsequent stages. 
Chapter 3 summarizes some major issues for analyzing data collections of Vietnamese 
documents/Web pages. We first review some characteristics of Vietnamese which are 
considered significant for data preprocessing and transformation in the subsequent 
processes. Next, we discuss more details about each step of preprocessing and 
transforming data. Important notes, including specific characteristics of Vietnamese are 
highlighted. Also, we demonstrate the results from topic analysis using LDA for the 
clean, preprocessed dataset. 
Chapter 4 describes the deployments of general frameworks proposed in Chapter 2 for 2 
tasks: search result classification, and search result clustering. The two implementations 
are based on the topic model analyzed from a universal dataset like shown in chapter 3. 
The Conclusion sums up the achievements throughout the previous four chapters. Some 
future research topics are also mentioned in this section. 
3 
Chapter 1. The Problem of Modeling Text Corpora and Hidden 
Topic Analysis 
1.1. Introduction 
The goal of modeling text corpora and other collections of discrete data is to find short 
description of the members of a collection that enable efficient processing of large 
collections while preserving the essential statistical relationships that are useful for basis 
tasks such as classification, clustering, summarization, and similarity and relevance 
judgments. 
Significant achievements have been made on this problem by researchers in the context of 
information retrieval (IR). Vector space model [48] (Salton and McGill, 1983) – a 
methodology successfully deployed in modern search technologies - is a typical approach 
proposed by IR researchers for modeling text corpora. In thi