Luận văn Www and the pagerank-Related problems - Nguyễn Hoài Nam

Human beings’ history has witnessed many big improvements but nothing is comparable to networks’, in general, and the WWW’s improvement, in particular. During the last decades, databases have grown exponentially in large storesand companies. In the old days, seekers faced many difficulties in findingenough data to feed their needs. The picture has changed and now the reverse picture is a daily problem – how to find relevant data in the information which is regularly accumulated.

pdf81 trang | Chia sẻ: vietpd | Lượt xem: 1326 | Lượt tải: 0download
Bạn đang xem trước 20 trang tài liệu Luận văn Www and the pagerank-Related problems - Nguyễn Hoài Nam, để xem tài liệu hoàn chỉnh bạn click vào nút DOWNLOAD ở trên
ĐẠI HỌC QUỐC GIA HÀ NỘI TRƯỜNG ĐẠI HỌC KHOA HỌC TỰ NHIÊN Nguyễn Hoài Nam WWW AND THE PAGERANK-RELATED PROBLEMS LUẬN VĂN THẠC SĨ KHOA HỌC Hà Nội - 2006 ĐẠI HỌC QUỐC GIA HÀ NỘI TRƯỜNG ĐẠI HỌC KHOA HỌC TỰ NHIÊN Nguyễn Hoài Nam WWW AND THE PAGERANK-RELATED PROBLEMS Chuyên ngành: Đảm bảo toán học cho máy tính và hệ thống tính toán Mã số: 1.01.07 LUẬN VĂN THẠC SĨ KHOA HỌC NGƯỜI HƯỚNG DẪN KHOA HỌC: PGS. TS. HÀ QUANG THỤY Hà Nội - 2006 HANOI NATIONAL UNIVERSITY UNIVERSITY OF SCIENCE Nguyen Hoai Nam WWW AND THE PAGERANK-RELATED PROBLEMS Major: Mathematical assurances for computers and computing systems Code: 1.01.07 MASTER THESIS THESIS SUPERVISOR: ASSOC. PROF. HA QUANG THUY Hanoi - 2006 Intentionally left blank for your note Contents List of Figures ii List of Tables iii Introduction iv Acknowledgement vi Abstract ix List of Glossaries xi 1 Objects’ ranks and applications to WWW 1 1.1 Rank of objects . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.1 Rank for objects different from web page. . . . . . . . . . 2 1.1.2 Linkage usage in search engine . . . . . . . . . . . . . 4 1.2 Basis of PageRank . . . . . . . . . . . . . . . . . . . . . . 10 1.2.1 Mathematical basis . . . . . . . . . . . . . . . . . . . 11 1.2.2 Practical issues. . . . . . . . . . . . . . . . . . . . . 13 Conclusion of chapter . . . . . . . . . . . . . . . . . . . . . . . 19 2 Some PageRank-related problems 20 2.1 Accelerating problems . . . . . . . . . . . . . . . . . . . . . 20 2.1.1 Related works . . . . . . . . . . . . . . . . . . . . . 21 2.1.2 Exploiting block structure of the Web . . . . . . . . . . . 22 2.2 Connected-component PageRank approach . . . . . . . . . . . 30 2.2.1 Initial ideas. . . . . . . . . . . . . . . . . . . . . . . 30 2.2.2 Mathematical basis of CCP . . . . . . . . . . . . . . . 32 2.2.3 On practical side of CCP. . . . . . . . . . . . . . . . . 35 2.3 Spam and spam detection . . . . . . . . . . . . . . . . . . . 37 2.3.1 Introduction to Web spam . . . . . . . . . . . . . . . . 37 2.3.2 Spam techniques . . . . . . . . . . . . . . . . . . . . 38 Conclusion of chapter . . . . . . . . . . . . . . . . . . . . . . . 42 3 Implementations and experimental results 43 3.1 Search engine Nutch . . . . . . . . . . . . . . . . . . . . . 43 3.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.2.1 Infrastructure and data sets . . . . . . . . . . . . . . . 48 3.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . 48 Conclusion of chapter . . . . . . . . . . . . . . . . . . . . . . . 58 Conclusions and future works 59 References 61 List of Figures 1.1 A small directed Web graph with 7 nodes . . . . . . . . . . . . . . . 13 1.2 Ranks of page in SmallWeb graph with α = .9 . . . . . . . . . . . . . 16 1.3 Figures exemplifying results with different α of SmallSet . . . . . . . 17 1.4 Graph of iterations needed with α ∈ [.5; .9] of SmallSet . . . . . . . . 18 2.1 Convergence rates for standard PageRank vs. BlockRank . . . . . . 29 2.2 Unarranged matrix and arranged matrix . . . . . . . . . . . . . . . . 31 2.3 Boosting techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.1 Nutch packages dependencies . . . . . . . . . . . . . . . . . . . . . 46 3.2 Matrices from set 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.3 Ranks from PageRank vs. CCP for set 1 . . . . . . . . . . . . . . . . 50 3.4 Ranks and differences corresponding to each block for set 1 . . . . 51 3.5 Time of two methods with different decay values for set 1 . . . . . . 51 3.6 No. of iterations of two methods with different decay values for set 1 52 3.7 Matrices from set 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.8 Ranks from PageRank vs. CCP for set 2 . . . . . . . . . . . . . . . . 53 3.9 Ranks and differences corresponding to each block for set 2 . . . . 53 3.10 Time of two methods with different decay values for set 2 . . . . . . 54 3.11 No. of iterations of two methods with different decay values for set 2 54 3.12 Matrices from set 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.13 Ranks from PageRank vs. CCP for set 3 . . . . . . . . . . . . . . . . 55 3.14 Ranks and differences corresponding to each block for set 3 . . . . 56 3.15 Time of two methods with different decay values for set 3 . . . . . . 56 3.16 No. of iterations of two methods with different decay values for set 3 57 i 3.17 Mean case for 3 sets with different decay values . . . . . . . . . . . 57 List of Tables 1.1 Iterations needed for each value of decay-factor α of SmallSet . . . 17 2.1 The closeness of local PageRank vector to global PageRank vector 25 3.1 Each set with corresponding sources . . . . . . . . . . . . . . . . . . 49 3.2 Each set with corresponding number of pages and links . . . . . . . 49 iii Introduction Human beings’ history has witnessed many big improvements but nothing is comparable to networks’, in general, and the WWW’s improvement, in partic- ular. During the last decades, databases have grown exponentially in large stores and companies. In the old days, seekers faced many difficulties in finding enough data to feed their needs. The picture has changed and now the reverse picture is a daily problem – how to find relevant data in the information which is regularly accumulated. The need therefore arises for better approaches that are able to handle complex models in a reasonable amount of time because the old models in statistics, machine learning and traditional data analysis fail to cope with this level of complexity. That is named data-mining with an important branch, Web-mining. The key inspiration of web-mining based on WWW’s growth. There is no doubt that WWW is the most valuable source of information but it is not very easy for you to get used to if you do not know the way. Several studies try to estimate the size of the Web andmost of them agree that there are over billions of pages. Additionally, the changing rate of the Web is even more dramatic [4]. According to [4], the size of the Web has doubled in less than two years, and this growth rate is projected to continue for the next two years. Although a enormous amount of information is available on the Web, you can hardly find appropriate information if you nothing supporting. For that reason, you need a search engine to make your work easier. We can consider search engines roughly as places that store information and return what we need after some operations. Everything returned by a iv search engine is sorted descendingly, that means entries which appear first seem to be more relevant. Even if that seems normal, many efforts were made to make this step straightforward. For various ways of approaching, there are many methods that deal with this problem. In this problem, we have to deter- mine a page’s importance score in order to sort it in the list of pages [16, 6]. The most famous approach discovered is PageRankTM from Google [1]. This thesis focuses on PageRank and its modifications to tweak up the computing process. With the aim of understanding network phenomena and, especially, PageR- ank computation, this thesis is organized as follow: • Chapter 1: Demontrates on the rank concept and PageRank. This chap- ter is the background on which the following chapters flourish. • Chapter 2: Takes care of some PageRank-related problems such as ac- celerating problem, spam. And, a new method is also proposed in this chapter (section 3.2) which we has published [19]. • Chapter 3: Concentrates on the practical side of PageRank with im- plementations and experiments. Conclusions and future works are also mentioned in this chapter. Acknowledgement My first words are to acknowledge all those people who had helped, contributed to the works in this thesis. It is obvious that I can not finish this thesis alone. It will be an impossible task if I, myself alone, work on all things such as de- ciding orientation, doing research, implementing... and many more things. I tried my best, and if your name is not listed, please trust implicitly in me that my gratitude is not less than those who are listed below. In the foremost position of my gratitude would stand Assoc. Prof. Ha Quang Thuy for his supervision, advice and guidance to me from the very first day when I newly accustom science. His endless enthusiasm for science has in- spired me in doing research and finally this thesis can be realized. Above all and most needed, he always tries his utmost to encourage and support me in various ways. Once again, I would like to send my most nicely thankful words to him. Many thanks would go to Assoc. Prof. Hoang Chi Thanh, Head of Depart- ment of Informatics where I am working, for his help and advice to me in both work and life. Without his help, my thesis can not be here as it is being now. It is a pleasure for me to express my grateful thanks to Prof. Pham Ky Anh for his generous disposition. I was allowed to use many facilities in the Center for High Performance Computing, where he is the Director, to do research and these facilities helped forming a big important part of my thesis. To Doan Duy Hai, Department of Computational and Applied Mathemat- ics, it is a great joy for me to have such a wonderful friend, colleague like him. May be he does not know how much I am indebted to him for his kind help. I thank him for his valuable advice in science discussion and for his precious vi time spending on helping me. His precise ideas were quite an unforgettable assistance to me with my scientific works. Furthermore, with his superb skill in LATEX, I can have help from an expert to drive the troubles away. Working environment is a key point in forming the way and attitude of working to me. I am very lucky to work in a good environment with wise and scholarly colleagues. I also benefit by academic courses and professional subjects from these wise and scholarly colleagues including my teachers and friends. I thank Nguyen Trung Kien who, with his unconditional help of im- plementing, running and storing data in my experiments, makes things easier for me. I have a nice collaboration to the seminar KDD group from Faculty of In- formation Technology, College of Technology. I was so providential to have sev- eral opportunities to work, discuss about science with those intellectuals. I especially thank Nguyen Thu Trang for her help and time working on pro- gramming with me. This work was supported in part by the National Project "Developing con- tent filter systems to support management and implementation public secu- rity - ensure policy" and the MoST-203906 Project "Information Extraction Models for discovering entities and semantic relations from Vietnamese Web pages". I can not wait anymore to wholeheartedly acknowledge my sweetheart. She is really an angel whose dedication, love, confidence in me and intelligence really took me over hard works. I am extremely lucky to be born in my family. My father, the first person who set the basis of my learning character, taught me the joy of intellectual recreation even when I was a child. My mother, the best mother in the world, with her unselfish sacrifice, bring me up with gently love. And, my younger brother, without his spiritual support, may I fail many times. My last words are to thank to everybody because you deserve that for your importance to me. Hanoi, Octorber 25th, 2006 Nguyen Hoai Nam Abstract In this thesis, we will study about the whole WWW on the point of view of social networks. There are many directions to come up to WWW but we chose the way of social networks because it will give us many advantages in doing research. Everytime we heard of WWW, that refer us to the Internet - the biggest network all over the world. Networks are all around us, all the time. From the biochemistry of our cells to the web of friendships across the planet. From the circuitry of modern electronics to chains of historical events. A network is the result of the forces that shaped it. Thus the principles of network formation can be, to some extent, deciphered from the network itself. All such information comprises the structure of the network. Moreover, the Internet has been improving rapidly, from a small network for experimental purpose to a world-wide network. The plentiful content of the World-Wide Web is useful to millions. But infomation is only available if you know where to find it. This thesis centres around some aspects of network- related problems: • How can we find needed infomation? The answer is the appearance of search engines. • How is returned information sorted? The key idea is to use a ranking ap- proach and the most famous approach, PageRankTM, is carefully studied in this thesis. • What are raised problems associated to the answers of two previous ques- tions? These problems cover storage capacity, computational complexity, ix accelerating methods, programming skills... and many more. To check whether the theory to tweak up the existing method is true, we implemented with data sets downloaded from the Internet. Experimental fig- ures show that our method is relatively good and it is really promissing. List of Glossaries Glossary name Description CCP Connected-component PageRank IR Information Retrieval PR PageRank URL Unified Resource Locator WWW World Wide Web xi CHAPTER Objects’ ranks and applications to WWW WWW is a very famous entity that you can meet everyday at any time and in any circumstance. Gradually it prevails on every aspect of our lives and it is definitely useful. Because of endlessly enormous information, which can be provided by the WWW, we must have a scheme to sort things in order to determine which is relatively suit- able. This chapter will guide you through one of the most topical problems addressed. 1.1 Rank of objects Many datasets of interest today are best described as a linked collection of interrelated objects. These may represent homogeneous networks, in which there is a single-object type and link type, or richer, heterogeneous networks, in which there may be multiple object and link types (and possibly other se- mantic information). Examples of homogeneous networks include single mode social networks, such as people connected by friendship links, or the WWW, 1 1.1 Rank of objects a collection of linked web pages. Examples of heterogeneous networks include those in medical domains describing patients, diseases, treatments and con- tacts, or in bibliographic domains describing publications, authors, and venues. These examples may include set of scientific papers. As all of us know, it is necessary that we credit others’ works if we use their works in our paper. That can be considered a link between our papers and credit-owner’s works. Link mining refers to data mining techniques that explicitly consider these links when building predictive or descriptive models of the linked data. Commonly addressed link mining tasks include object ranking, group detection, collective classification, link prediction and subgraph discovery. While network analysis has been studied in depth in particular areas such as social network analy- sis, hypertext mining, and web analysis, only recently has there been a cross- fertilization of ideas among these different communities. This is an exciting, rapidly expanding area. In this section, examples on ranking objects in many kinds of datasets are examined to insist that an approach to sort relevant results is required. Then, the section will go on with the need of a ranking algorithm for WWW–one of the most important communities in our contemporary world. 1.1.1 Rank for objects different from web page Progress in digital data acquisition and storage technology has resulted in the growth of huge databases. This has occurred in all areas of human en- deavor, from the mundane (such as supermarket transaction data, credit card usage records, telephone call details, and government statistics) to the more exotic (such as images of astronomical bodies, molecular databases, and med- ical records). Little wonder, then, that interest has grown in the possibility of tapping these data, of extracting from them information that might be of value to the owner of the database. The discipline concerned with this task has become known as data mining. Data mining, with its aim to find knowledge from a "big mountain" of infor- mation, needs a way to sort things extracted. We will study some examples to decide that a ranking algorithm is "must-have" part of every mining system. 2 1.1 Rank of objects Example 1.1. Consider some online computers store that accept queries on the Configuration and Price attributes of a goods. Agents could treat query conditions as if they were regular Boolean conditions then returns relevant things. This way, agent or any client can determine an acceptable radius of pattern and price around the preferred products. However, suppose we are browsing a very big store, there could be too many matching items making users’ tasks difficult. Additionally, many items that are slightly weaker in specification but price are very good in the searching range are missing. Thus, we should think of a new ranking system which ranks results if they are closest to the Configuration given and have best Price or relatively inex- pensive.  In addition, we can propose, briefly, a method to rank items as follow. Example 1.2. Suppose, with two criteria Configuration and Price used, each criterion is assigned weight as in 0.8c+0.2p where c ∈ [0, 1] shows how close the result is to the target configuration and p ∈ [0, 1] indicates how close the price is to the target price. Looking at the formula, if a computer has higher relevant configuration, it seems to be ranked higher in results. If another store would want to weigh configuration and price equally, the formula should be 0.5c+0.5p. Suppose that a customer wants to look for a computer with specific con- figuration and target price of 500US$. Besides, in the store, there is only one computer with the specific configuration (c = 1, quite good) and high price (i.e. p = 0.2). All the remaining computers are in the near configuration with c = 0.8 and the price for all is moderate, p = 0.5. Note that, with the same or near configuration, computers can differ in price due to the make of manufac- turers. Applying the definition above, we compute the relevance score of each com- puter type. The computer, which suites best in configuration have score of 0.8 × 1 + 0.2 × 0.2 = 0.84 whereas other computers should have scores of 0.8×0.8+0.2×0.5 = 0.74. Therefore, in this store, the computer with best rele- vant configuration should be ranked first in results and others will have lower ranks. What will happen if we consider a store which assigns the same weight to both configuration and price? Let us have a computation of these computers. The computer, which suites best in configuration has score of 0.5×1+0.5×0.2 = 3 1.1 Rank of objects 0.6 whereas other computers should have scores of 0.5 × 0.8 + 0.5 × 0.5 = 0.65. Therefore, in this store, the answer is different. Computers with unlike config- uration but good price should be the first choice because they have high scores and the computer with the same configuration but high price would receive a worse vote from the store’s algorithm.  Problem arises in Example 1.2 is not the only problem. We will face another problem which is described Example 1.3 Example 1.3. Suppose that there is a service, which combines results from many online stores to provide information to customers. Therefore, this ser- vice has to query many stores at once then combines the results into