Luận văn Hidden topic discovery toward classification and clustering in vietnamese web documents

My deepest thank must first go to my research advisor, Prof. Dr. Ha Quang Thuy, who offers me an endless inspiration in scientific research, leading me to this research area. I particularly appreciate his unconditional support and advice in both academic environment and daily life during the last four years.

pdf67 trang | Chia sẻ: vietpd | Lượt xem: 1336 | Lượt tải: 1download
Bạn đang xem trước 20 trang tài liệu Luận văn Hidden topic discovery toward classification and clustering in vietnamese web documents, để xem tài liệu hoàn chỉnh bạn click vào nút DOWNLOAD ở trên
VIET NAM NATIONAL UNIVERSITY, HANOI COLLEGE OF TECHNOLOGY NGUYEN CAM TU HIDDEN TOPIC DISCOVERY TOWARD CLASSIFICATION AND CLUSTERING IN VIETNAMESE WEB DOCUMENTS MASTER THESIS HANOI - 2008 VIET NAM NATIONAL UNIVERSITY, HANOI COLLEGE OF TECHNOLOGY NGUYEN CAM TU HIDDEN TOPIC DISCOVERY TOWARD CLASSIFICATION AND CLUSTERING IN VIETNAMESE WEB DOCUMENTS Major: Information Technology Specificity: Information Systems Code: 60 48 05 MASTER THESIS SUPERVISOR: Prof. Dr. Ha Quang Thuy HANOI - 2008 i Acknowledgements My deepest thank must first go to my research advisor, Prof. Dr. Ha Quang Thuy, who offers me an endless inspiration in scientific research, leading me to this research area. I particularly appreciate his unconditional support and advice in both academic environment and daily life during the last four years. Many thanks go to Dr. Phan Xuan Hieu who has given me many advices and comments. This work can not be possible without his support. Also, I would like to thank him for being my friend, my older brother who has brought me a lot of lessons in both scientific research and daily life. My thanks also go to all members of seminar group “data mining”. Especially, I would like to thank Bsc. Nguyen Thu Trang for helping me a lot in collecting data and doing experiments. I highly acknowledge the invaluable support and advice in both technical and daily life of my teachers, my colleagues in Department of Information Systems, Faculty of Technology, Vietnam National University, Hanoi I also want to thank the supports from the Project QC.06.07 “Vietnamese Named Entity Resolution and Tracking crossover Web Documents”, Vietnam National University, Hanoi; the Project 203906 “`Information Extraction Models for finding Entities and Semantic Relations in Vietnamese Web Pages'' of the Ministry of Science and Technology, Vietnam; and the National Project 02/2006/HĐ - ĐTCT-KC.01/06-10 “Developing content filter systems to support management and implementation public security – ensure policy” Finally, from bottom of my heart, I would specially like to say thanks to all members in my family, all my friends. They are really an endless encouragement in my life. Nguyen Cam Tu ii Assurance I certify that the achievements in this thesis belong to my personal, and are not copied from any other’s results. Throughout the dissertation, all the mentions are either my proposal, or summarized from many sources. All the references have clear origins, and properly quoted. I am responsible for this statement. Hanoi, November 15, 2007 Nguyen Cam Tu iii Table of Content Introduction ..........................................................................................................................1 Chapter 1. The Problem of Modeling Text Corpora and Hidden Topic Analysis ...............3 1.1. Introduction ...............................................................................................................3 1.2. The Early Methods ....................................................................................................5 1.2.1. Latent Semantic Analysis ...................................................................................5 1.2.2. Probabilistic Latent Semantic Analysis..............................................................8 1.3. Latent Dirichlet Allocation......................................................................................11 1.3.1. Generative Model in LDA................................................................................12 1.3.2. Likelihood.........................................................................................................13 1.3.3. Parameter Estimation and Inference via Gibbs Sampling................................14 1.3.4. Applications......................................................................................................17 1.4. Summary..................................................................................................................17 Chapter 2. Frameworks of Learning with Hidden Topics..................................................19 2.1. Learning with External Resources: Related Works ................................................19 2.2. General Learning Frameworks ................................................................................20 2.2.1. Frameworks for Learning with Hidden Topics ................................................20 2.2.2. Large-Scale Web Collections as Universal Dataset .........................................22 2.3. Advantages of the Frameworks ...............................................................................23 2.4. Summary..................................................................................................................23 Chapter 3. Topics Analysis of Large-Scale Web Dataset ..................................................24 3.1. Some Characteristics of Vietnamese .......................................................................24 3.1.1. Sound................................................................................................................24 3.1.2. Syllable Structure .............................................................................................26 3.1.3. Vietnamese Word .............................................................................................26 3.2. Preprocessing and Transformation..........................................................................27 3.2.1. Sentence Segmentation.....................................................................................27 iv 3.2.2. Sentence Tokenization......................................................................................28 3.2.3. Word Segmentation ..........................................................................................28 3.2.4. Filters ................................................................................................................28 3.2.5. Remove Non Topic-Oriented Words ...............................................................28 3.3. Topic Analysis for VnExpress Dataset ...................................................................29 3.4. Topic Analysis for Vietnamese Wikipedia Dataset ................................................30 3.5. Discussion................................................................................................................31 3.6. Summary..................................................................................................................32 Chapter 4. Deployments of General Frameworks ..............................................................33 4.1. Classification with Hidden Topics ..........................................................................33 4.1.1. Classification Method.......................................................................................33 4.1.2. Experiments ......................................................................................................36 4.2. Clustering with Hidden Topics................................................................................40 4.2.1. Clustering Method ............................................................................................40 4.2.2. Experiments ......................................................................................................45 4.3. Summary..................................................................................................................49 Conclusion ..........................................................................................................................50 Achievements throughout the thesis...............................................................................50 Future Works ..................................................................................................................50 References ..........................................................................................................................52 Vietnamese References ..................................................................................................52 English References .........................................................................................................52 Appendix: Some Clustering Results...................................................................................56 v List of Figures Figure 1.1. Graphical model representation of the aspect model in the asymmetric (a) and symmetric (b) parameterization. ( [55]) ...............................................................................9 Figure 1.2. Sketch of the probability sub-simplex spanned by the aspect model ( [55])...10 Figure 1.3. Graphical model representation of LDA - The boxes is “plates” representing replicates. The outer plate represents documents, while the inner plate represents the repeated choice of topics and words within a document [20] ............................................12 Figure 1.4. Generative model for Latent Dirichlet allocation; Here, Dir, Poiss and Mult stand for Dirichlet, Poisson, Multinomial distributions respectively.................................13 Figure 1.5. Quantities in the model of latent Dirichlet allocation......................................13 Figure 1.6. Gibbs sampling algorithm for Latent Dirichlet Allocation..............................16 Figure 2.1. Classification with Hidden Topics...................................................................20 Figure 2.2. Clustering with Hidden Topics ........................................................................21 Figure 3.1. Pipeline of Data Preprocessing and Transformation .......................................27 Figure 4.1. Classification with VnExpress topics .............................................................33 Figure 4.2 Combination of one snippet with its topics: an example ..................................35 Figure 4.3. Learning with different topic models of VnExpress dataset; and the baseline (without topics)...................................................................................................................37 Figure 4.4. Test-out-of train with increasing numbers of training examples. Here, the number of topics is set at 60topics .....................................................................................37 Figure 4.5 F1-Measure for classes and average (over all classes) in learning with 60 topics...................................................................................................................................39 Figure 4.6. Clustering with Hidden Topics ........................................................................40 Figure 4.7. Dendrogram in Agglomerative Hierarchical Clustering..................................42 Figure 4.8 Precision of top 5 (and 10, 20) in best clusters for each query.........................47 Figure 4.9 Coverage of the top 5 (and 10) good clusters for each query ...........................47 vi List of Tables Table 3.1. Vowels in Vietnamese.......................................................................................24 Table 3.2. Tones in Vietnamese .........................................................................................25 Table 3.3. Consonants of hanoi variety ..............................................................................26 Table 3.4. Structure of Vietnamese syllables ....................................................................26 Table 3.5. Functional words in Vietnamese .......................................................................29 Table 3.6. Statistics of topics assigned by humans in VnExpress Dataset.........................29 Table 3.7. Statistics of VnExpress dataset .........................................................................30 Table 3.8 Most likely words for sample topics. Here, we conduct topic analysis with 100 topics...................................................................................................................................30 Table 3.9. Statistic of Vietnamese Wikipedia Dataset ......................................................31 Table 3.10 Most likely words for sample topics. Here, we conduct topic analysis with 200 topics...................................................................................................................................31 Table 4.1 Google search results as training and testing dataset. The search phrases for training and test data are designed to be exclusive ............................................................34 Table 4.2. Experimental results of baseline (learning without topics)...............................38 Table 4.3. Experimental results of learning with 60 topics of VnExpress dataset.............38 Table 4.4. Some collocations with highest values of chi-square statistic ..........................44 Table 4.5. Queries submitted to Google.............................................................................45 Table 4.6. Parameters for clustering web search results ....................................................46 vii Notations & Abbreviations Word or phrase Abbreviation Information Retrieval IR Latent Semantic Analysis LSA Probability Latent Semantic Analysis PLSA Latent Dirichlet Allocation LDA Dynamic Topic Models DTM Correlated Topic Models CTM Singular Value Decomposition SVD 1 Introduction The World Wide Web has influenced many aspects of our lives, changing the way we communicate, conduct business, shop, entertain, and so on. However, a large portion of the Web data is not organized in systematic and well structured forms, a situation which causes great challenges to those seeking for information on the Web. Consequently, a lot of tasks enabling users to search, navigate and organize web pages in a more effective way have been posed in the last decade, such as searching, page rank, web clustering, text classification, etc. To this end, there have been a lot of successful stories like Google, Yahoo, Open Directory Project (Dmoz), Clusty, just to name but a few. Inspired by this trend, the aim of this thesis is to develop efficient systems which are able to overcome the difficulties of dealing with sparse data. The main motivation is that while being overwhelmed by a huge amount of online data, we sometimes lack data to search or learn effectively. Let take web search clustering as an example. In order to meet the real-time condition, that is the response time must be short enough, most of online clustering systems only work with small pieces of text returned from search engines. Unfortunately those pieces are not long and rich enough to build a good clustering system. A similar situation occurs in the case of searching images only based on captions. Because image captions are only very short and sparse chunks of text, most of the current image retrieval systems still fail to achieve high accuracy. As a result, much effort has been made recently to take advantage of external resources like learning with knowledge-base support, semi-supervised learning, etc. in order to improve the accuracy. These approaches, however, have some difficulties: (1) constructing a knowledge base is very time-consuming & labor-intensive, and (2) the results of semi-supervised learning in one application cannot be reused in another one even in the same domain. In the thesis, we introduce two general frameworks for learning with hidden topics discovered from large-scale data collections: one for clustering and another for classification. Unlike semi-supervised learning, we approach this issue from the point of view of text/web data analysis that is based on recently successful topic analysis models, such as Latent Semantic Analysis, Probabilistic-Latent Semantic Analysis, or Latent Dirichlet Allocation. The underlying idea of the frameworks is that for a domain we collect a very large external data collection called “universal dataset”, and then build the learner on both the original data (like snippets or image captions) and a rich set of hidden topics discovered from the universal data collection. The general frameworks are flexible 2 and general enough to apply for a wide range of domains and languages. Once we analyze a universal dataset, the resulting hidden topics can be used for several learning tasks in the same domain. This is also particularly useful for sparse data mining. Sparse data like snippets returned from a search engine can be expanded and enriched with hidden topics. Thus, a better performance can be achieved. Moreover, because the method can learn with smaller data (the meaningful hidden topics rather than all unlabeled data), it requires less computational resources than semi-supervised learning. Roadmap: The organization of this thesis is follow Chapter 1 reviews some typical topic analysis methods such as Latent Semantic Analysis, Probabilistic Latent Semantic Analysis, and Latent Dirichlet Allocation. These models can be considered the basic building blocks of general framework of probabilistic modeling of text and be used to develop more sophisticated and application-oriented models, such as hierarchical models, author-role models, entity models, and so on. They can also be considered key components in our proposals in subsequent chapters. Chapter 2 introduces two general frameworks for learning with hidden topics: one for classification and one for clustering. These frameworks are flexible and general enough to apply in many domains of applications. The key common phrase between the two frameworks is topic analysis for large-scale collections of web documents. The quality of the hidden topic described in this chapter will much influence the performance of subsequent stages. Chapter 3 summarizes some major issues for analyzing data collections of Vietnamese documents/Web pages. We first review some characteristics of Vietnamese which are considered significant for data preprocessing and transformation in the subsequent processes. Next, we discuss more details about each step of preprocessing and transforming data. Important notes, including specific characteristics of Vietnamese are highlighted. Also, we demonstrate the results from topic analysis using LDA for the clean, preprocessed dataset. Chapter 4 describes the deployments of general frameworks proposed in Chapter 2 for 2 tasks: search result classification, and search result clustering. The two implementations are based on the topic model analyzed from a universal dataset like shown in chapter 3. The Conclusion sums up the achievements throughout the previous four chapters. Some future research topics are also mentioned in this section. 3 Chapter 1. The Problem of Modeling Text Corpora and Hidden Topic Analysis 1.1. Introduction The goal of modeling text corpora and other collections of discrete data is to find short description of the members of a collection that enable efficient processing of large collections while preserving the essential statistical relationships that are useful for basis tasks such as classification, clustering, summarization, and similarity and relevance judgments. Significant achievements have been made on this problem by researchers in the context of information retrieval (IR). Vector space model [48] (Salton and McGill, 1983) – a methodology successfully deployed in modern search technologies - is a typical approach proposed by IR researchers for modeling text corpora. In thi