My deepest thank must first go to my research advisor, Prof. Dr. Ha Quang Thuy, who offers me an endless inspiration in scientific research, leading me to this research area. I particularly appreciate his unconditional support and advice in both academic environment and daily life during the last four years.
67 trang |
Chia sẻ: vietpd | Lượt xem: 1336 | Lượt tải: 1
Bạn đang xem trước 20 trang tài liệu Luận văn Hidden topic discovery toward classification and clustering in vietnamese web documents, để xem tài liệu hoàn chỉnh bạn click vào nút DOWNLOAD ở trên
VIET NAM NATIONAL UNIVERSITY, HANOI
COLLEGE OF TECHNOLOGY
NGUYEN CAM TU
HIDDEN TOPIC DISCOVERY TOWARD
CLASSIFICATION AND CLUSTERING IN
VIETNAMESE WEB DOCUMENTS
MASTER THESIS
HANOI - 2008
VIET NAM NATIONAL UNIVERSITY, HANOI
COLLEGE OF TECHNOLOGY
NGUYEN CAM TU
HIDDEN TOPIC DISCOVERY TOWARD
CLASSIFICATION AND CLUSTERING IN
VIETNAMESE WEB DOCUMENTS
Major: Information Technology
Specificity: Information Systems
Code: 60 48 05
MASTER THESIS
SUPERVISOR: Prof. Dr. Ha Quang Thuy
HANOI - 2008
i
Acknowledgements
My deepest thank must first go to my research advisor, Prof. Dr. Ha Quang Thuy, who
offers me an endless inspiration in scientific research, leading me to this research area. I
particularly appreciate his unconditional support and advice in both academic
environment and daily life during the last four years.
Many thanks go to Dr. Phan Xuan Hieu who has given me many advices and comments.
This work can not be possible without his support. Also, I would like to thank him for
being my friend, my older brother who has brought me a lot of lessons in both scientific
research and daily life.
My thanks also go to all members of seminar group “data mining”. Especially, I would
like to thank Bsc. Nguyen Thu Trang for helping me a lot in collecting data and doing
experiments.
I highly acknowledge the invaluable support and advice in both technical and daily life of
my teachers, my colleagues in Department of Information Systems, Faculty of
Technology, Vietnam National University, Hanoi
I also want to thank the supports from the Project QC.06.07 “Vietnamese Named Entity
Resolution and Tracking crossover Web Documents”, Vietnam National University,
Hanoi; the Project 203906 “`Information Extraction Models for finding Entities and
Semantic Relations in Vietnamese Web Pages'' of the Ministry of Science and
Technology, Vietnam; and the National Project 02/2006/HĐ - ĐTCT-KC.01/06-10
“Developing content filter systems to support management and implementation public
security – ensure policy”
Finally, from bottom of my heart, I would specially like to say thanks to all members in
my family, all my friends. They are really an endless encouragement in my life.
Nguyen Cam Tu
ii
Assurance
I certify that the achievements in this thesis belong to my personal, and are not copied
from any other’s results. Throughout the dissertation, all the mentions are either my
proposal, or summarized from many sources. All the references have clear origins, and
properly quoted. I am responsible for this statement.
Hanoi, November 15, 2007
Nguyen Cam Tu
iii
Table of Content
Introduction ..........................................................................................................................1
Chapter 1. The Problem of Modeling Text Corpora and Hidden Topic Analysis ...............3
1.1. Introduction ...............................................................................................................3
1.2. The Early Methods ....................................................................................................5
1.2.1. Latent Semantic Analysis ...................................................................................5
1.2.2. Probabilistic Latent Semantic Analysis..............................................................8
1.3. Latent Dirichlet Allocation......................................................................................11
1.3.1. Generative Model in LDA................................................................................12
1.3.2. Likelihood.........................................................................................................13
1.3.3. Parameter Estimation and Inference via Gibbs Sampling................................14
1.3.4. Applications......................................................................................................17
1.4. Summary..................................................................................................................17
Chapter 2. Frameworks of Learning with Hidden Topics..................................................19
2.1. Learning with External Resources: Related Works ................................................19
2.2. General Learning Frameworks ................................................................................20
2.2.1. Frameworks for Learning with Hidden Topics ................................................20
2.2.2. Large-Scale Web Collections as Universal Dataset .........................................22
2.3. Advantages of the Frameworks ...............................................................................23
2.4. Summary..................................................................................................................23
Chapter 3. Topics Analysis of Large-Scale Web Dataset ..................................................24
3.1. Some Characteristics of Vietnamese .......................................................................24
3.1.1. Sound................................................................................................................24
3.1.2. Syllable Structure .............................................................................................26
3.1.3. Vietnamese Word .............................................................................................26
3.2. Preprocessing and Transformation..........................................................................27
3.2.1. Sentence Segmentation.....................................................................................27
iv
3.2.2. Sentence Tokenization......................................................................................28
3.2.3. Word Segmentation ..........................................................................................28
3.2.4. Filters ................................................................................................................28
3.2.5. Remove Non Topic-Oriented Words ...............................................................28
3.3. Topic Analysis for VnExpress Dataset ...................................................................29
3.4. Topic Analysis for Vietnamese Wikipedia Dataset ................................................30
3.5. Discussion................................................................................................................31
3.6. Summary..................................................................................................................32
Chapter 4. Deployments of General Frameworks ..............................................................33
4.1. Classification with Hidden Topics ..........................................................................33
4.1.1. Classification Method.......................................................................................33
4.1.2. Experiments ......................................................................................................36
4.2. Clustering with Hidden Topics................................................................................40
4.2.1. Clustering Method ............................................................................................40
4.2.2. Experiments ......................................................................................................45
4.3. Summary..................................................................................................................49
Conclusion ..........................................................................................................................50
Achievements throughout the thesis...............................................................................50
Future Works ..................................................................................................................50
References ..........................................................................................................................52
Vietnamese References ..................................................................................................52
English References .........................................................................................................52
Appendix: Some Clustering Results...................................................................................56
v
List of Figures
Figure 1.1. Graphical model representation of the aspect model in the asymmetric (a) and
symmetric (b) parameterization. ( [55]) ...............................................................................9
Figure 1.2. Sketch of the probability sub-simplex spanned by the aspect model ( [55])...10
Figure 1.3. Graphical model representation of LDA - The boxes is “plates” representing
replicates. The outer plate represents documents, while the inner plate represents the
repeated choice of topics and words within a document [20] ............................................12
Figure 1.4. Generative model for Latent Dirichlet allocation; Here, Dir, Poiss and Mult
stand for Dirichlet, Poisson, Multinomial distributions respectively.................................13
Figure 1.5. Quantities in the model of latent Dirichlet allocation......................................13
Figure 1.6. Gibbs sampling algorithm for Latent Dirichlet Allocation..............................16
Figure 2.1. Classification with Hidden Topics...................................................................20
Figure 2.2. Clustering with Hidden Topics ........................................................................21
Figure 3.1. Pipeline of Data Preprocessing and Transformation .......................................27
Figure 4.1. Classification with VnExpress topics .............................................................33
Figure 4.2 Combination of one snippet with its topics: an example ..................................35
Figure 4.3. Learning with different topic models of VnExpress dataset; and the baseline
(without topics)...................................................................................................................37
Figure 4.4. Test-out-of train with increasing numbers of training examples. Here, the
number of topics is set at 60topics .....................................................................................37
Figure 4.5 F1-Measure for classes and average (over all classes) in learning with 60
topics...................................................................................................................................39
Figure 4.6. Clustering with Hidden Topics ........................................................................40
Figure 4.7. Dendrogram in Agglomerative Hierarchical Clustering..................................42
Figure 4.8 Precision of top 5 (and 10, 20) in best clusters for each query.........................47
Figure 4.9 Coverage of the top 5 (and 10) good clusters for each query ...........................47
vi
List of Tables
Table 3.1. Vowels in Vietnamese.......................................................................................24
Table 3.2. Tones in Vietnamese .........................................................................................25
Table 3.3. Consonants of hanoi variety ..............................................................................26
Table 3.4. Structure of Vietnamese syllables ....................................................................26
Table 3.5. Functional words in Vietnamese .......................................................................29
Table 3.6. Statistics of topics assigned by humans in VnExpress Dataset.........................29
Table 3.7. Statistics of VnExpress dataset .........................................................................30
Table 3.8 Most likely words for sample topics. Here, we conduct topic analysis with 100
topics...................................................................................................................................30
Table 3.9. Statistic of Vietnamese Wikipedia Dataset ......................................................31
Table 3.10 Most likely words for sample topics. Here, we conduct topic analysis with 200
topics...................................................................................................................................31
Table 4.1 Google search results as training and testing dataset. The search phrases for
training and test data are designed to be exclusive ............................................................34
Table 4.2. Experimental results of baseline (learning without topics)...............................38
Table 4.3. Experimental results of learning with 60 topics of VnExpress dataset.............38
Table 4.4. Some collocations with highest values of chi-square statistic ..........................44
Table 4.5. Queries submitted to Google.............................................................................45
Table 4.6. Parameters for clustering web search results ....................................................46
vii
Notations & Abbreviations
Word or phrase Abbreviation
Information Retrieval IR
Latent Semantic Analysis LSA
Probability Latent Semantic Analysis PLSA
Latent Dirichlet Allocation LDA
Dynamic Topic Models DTM
Correlated Topic Models CTM
Singular Value Decomposition SVD
1
Introduction
The World Wide Web has influenced many aspects of our lives, changing the way we
communicate, conduct business, shop, entertain, and so on. However, a large portion of
the Web data is not organized in systematic and well structured forms, a situation which
causes great challenges to those seeking for information on the Web. Consequently, a lot
of tasks enabling users to search, navigate and organize web pages in a more effective
way have been posed in the last decade, such as searching, page rank, web clustering, text
classification, etc. To this end, there have been a lot of successful stories like Google,
Yahoo, Open Directory Project (Dmoz), Clusty, just to name but a few.
Inspired by this trend, the aim of this thesis is to develop efficient systems which
are able to overcome the difficulties of dealing with sparse data. The main motivation is
that while being overwhelmed by a huge amount of online data, we sometimes lack data
to search or learn effectively. Let take web search clustering as an example. In order to
meet the real-time condition, that is the response time must be short enough, most of
online clustering systems only work with small pieces of text returned from search
engines. Unfortunately those pieces are not long and rich enough to build a good
clustering system. A similar situation occurs in the case of searching images only based
on captions. Because image captions are only very short and sparse chunks of text, most
of the current image retrieval systems still fail to achieve high accuracy. As a result, much
effort has been made recently to take advantage of external resources like learning with
knowledge-base support, semi-supervised learning, etc. in order to improve the accuracy.
These approaches, however, have some difficulties: (1) constructing a knowledge base is
very time-consuming & labor-intensive, and (2) the results of semi-supervised learning in
one application cannot be reused in another one even in the same domain.
In the thesis, we introduce two general frameworks for learning with hidden topics
discovered from large-scale data collections: one for clustering and another for
classification. Unlike semi-supervised learning, we approach this issue from the point of
view of text/web data analysis that is based on recently successful topic analysis models,
such as Latent Semantic Analysis, Probabilistic-Latent Semantic Analysis, or Latent
Dirichlet Allocation. The underlying idea of the frameworks is that for a domain we
collect a very large external data collection called “universal dataset”, and then build the
learner on both the original data (like snippets or image captions) and a rich set of hidden
topics discovered from the universal data collection. The general frameworks are flexible
2
and general enough to apply for a wide range of domains and languages. Once we analyze
a universal dataset, the resulting hidden topics can be used for several learning tasks in the
same domain. This is also particularly useful for sparse data mining. Sparse data like
snippets returned from a search engine can be expanded and enriched with hidden topics.
Thus, a better performance can be achieved. Moreover, because the method can learn with
smaller data (the meaningful hidden topics rather than all unlabeled data), it requires less
computational resources than semi-supervised learning.
Roadmap: The organization of this thesis is follow
Chapter 1 reviews some typical topic analysis methods such as Latent Semantic Analysis,
Probabilistic Latent Semantic Analysis, and Latent Dirichlet Allocation. These models
can be considered the basic building blocks of general framework of probabilistic
modeling of text and be used to develop more sophisticated and application-oriented
models, such as hierarchical models, author-role models, entity models, and so on. They
can also be considered key components in our proposals in subsequent chapters.
Chapter 2 introduces two general frameworks for learning with hidden topics: one for
classification and one for clustering. These frameworks are flexible and general enough to
apply in many domains of applications. The key common phrase between the two
frameworks is topic analysis for large-scale collections of web documents. The quality of
the hidden topic described in this chapter will much influence the performance of
subsequent stages.
Chapter 3 summarizes some major issues for analyzing data collections of Vietnamese
documents/Web pages. We first review some characteristics of Vietnamese which are
considered significant for data preprocessing and transformation in the subsequent
processes. Next, we discuss more details about each step of preprocessing and
transforming data. Important notes, including specific characteristics of Vietnamese are
highlighted. Also, we demonstrate the results from topic analysis using LDA for the
clean, preprocessed dataset.
Chapter 4 describes the deployments of general frameworks proposed in Chapter 2 for 2
tasks: search result classification, and search result clustering. The two implementations
are based on the topic model analyzed from a universal dataset like shown in chapter 3.
The Conclusion sums up the achievements throughout the previous four chapters. Some
future research topics are also mentioned in this section.
3
Chapter 1. The Problem of Modeling Text Corpora and Hidden
Topic Analysis
1.1. Introduction
The goal of modeling text corpora and other collections of discrete data is to find short
description of the members of a collection that enable efficient processing of large
collections while preserving the essential statistical relationships that are useful for basis
tasks such as classification, clustering, summarization, and similarity and relevance
judgments.
Significant achievements have been made on this problem by researchers in the context of
information retrieval (IR). Vector space model [48] (Salton and McGill, 1983) – a
methodology successfully deployed in modern search technologies - is a typical approach
proposed by IR researchers for modeling text corpora. In thi