Human beings’ history has witnessed many big improvements but nothing is comparable to networks’, in general, and the WWW’s improvement, in particular.
During the last decades, databases have grown exponentially in large storesand companies. In the old days, seekers faced many difficulties in findingenough data to feed their needs. The picture has changed and now the reverse picture is a daily problem – how to find relevant data in the information which is regularly accumulated.
81 trang |
Chia sẻ: vietpd | Lượt xem: 1329 | Lượt tải: 0
Bạn đang xem trước 20 trang tài liệu Luận văn Www and the pagerank-Related problems - Nguyễn Hoài Nam, để xem tài liệu hoàn chỉnh bạn click vào nút DOWNLOAD ở trên
ĐẠI HỌC QUỐC GIA HÀ NỘI
TRƯỜNG ĐẠI HỌC KHOA HỌC TỰ NHIÊN
Nguyễn Hoài Nam
WWW AND THE PAGERANK-RELATED
PROBLEMS
LUẬN VĂN THẠC SĨ KHOA HỌC
Hà Nội - 2006
ĐẠI HỌC QUỐC GIA HÀ NỘI
TRƯỜNG ĐẠI HỌC KHOA HỌC TỰ NHIÊN
Nguyễn Hoài Nam
WWW AND THE PAGERANK-RELATED
PROBLEMS
Chuyên ngành: Đảm bảo toán học cho máy tính và hệ thống tính toán
Mã số: 1.01.07
LUẬN VĂN THẠC SĨ KHOA HỌC
NGƯỜI HƯỚNG DẪN KHOA HỌC:
PGS. TS. HÀ QUANG THỤY
Hà Nội - 2006
HANOI NATIONAL UNIVERSITY
UNIVERSITY OF SCIENCE
Nguyen Hoai Nam
WWW AND THE PAGERANK-RELATED
PROBLEMS
Major: Mathematical assurances for computers and computing systems
Code: 1.01.07
MASTER THESIS
THESIS SUPERVISOR:
ASSOC. PROF. HA QUANG THUY
Hanoi - 2006
Intentionally left blank for your note
Contents
List of Figures ii
List of Tables iii
Introduction iv
Acknowledgement vi
Abstract ix
List of Glossaries xi
1 Objects’ ranks and applications to WWW 1
1.1 Rank of objects . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Rank for objects different from web page. . . . . . . . . . 2
1.1.2 Linkage usage in search engine . . . . . . . . . . . . . 4
1.2 Basis of PageRank . . . . . . . . . . . . . . . . . . . . . . 10
1.2.1 Mathematical basis . . . . . . . . . . . . . . . . . . . 11
1.2.2 Practical issues. . . . . . . . . . . . . . . . . . . . . 13
Conclusion of chapter . . . . . . . . . . . . . . . . . . . . . . . 19
2 Some PageRank-related problems 20
2.1 Accelerating problems . . . . . . . . . . . . . . . . . . . . . 20
2.1.1 Related works . . . . . . . . . . . . . . . . . . . . . 21
2.1.2 Exploiting block structure of the Web . . . . . . . . . . . 22
2.2 Connected-component PageRank approach . . . . . . . . . . . 30
2.2.1 Initial ideas. . . . . . . . . . . . . . . . . . . . . . . 30
2.2.2 Mathematical basis of CCP . . . . . . . . . . . . . . . 32
2.2.3 On practical side of CCP. . . . . . . . . . . . . . . . . 35
2.3 Spam and spam detection . . . . . . . . . . . . . . . . . . . 37
2.3.1 Introduction to Web spam . . . . . . . . . . . . . . . . 37
2.3.2 Spam techniques . . . . . . . . . . . . . . . . . . . . 38
Conclusion of chapter . . . . . . . . . . . . . . . . . . . . . . . 42
3 Implementations and experimental results 43
3.1 Search engine Nutch . . . . . . . . . . . . . . . . . . . . . 43
3.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.2.1 Infrastructure and data sets . . . . . . . . . . . . . . . 48
3.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . 48
Conclusion of chapter . . . . . . . . . . . . . . . . . . . . . . . 58
Conclusions and future works 59
References 61
List of Figures
1.1 A small directed Web graph with 7 nodes . . . . . . . . . . . . . . . 13
1.2 Ranks of page in SmallWeb graph with α = .9 . . . . . . . . . . . . . 16
1.3 Figures exemplifying results with different α of SmallSet . . . . . . . 17
1.4 Graph of iterations needed with α ∈ [.5; .9] of SmallSet . . . . . . . . 18
2.1 Convergence rates for standard PageRank vs. BlockRank . . . . . . 29
2.2 Unarranged matrix and arranged matrix . . . . . . . . . . . . . . . . 31
2.3 Boosting techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.1 Nutch packages dependencies . . . . . . . . . . . . . . . . . . . . . 46
3.2 Matrices from set 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.3 Ranks from PageRank vs. CCP for set 1 . . . . . . . . . . . . . . . . 50
3.4 Ranks and differences corresponding to each block for set 1 . . . . 51
3.5 Time of two methods with different decay values for set 1 . . . . . . 51
3.6 No. of iterations of two methods with different decay values for set 1 52
3.7 Matrices from set 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.8 Ranks from PageRank vs. CCP for set 2 . . . . . . . . . . . . . . . . 53
3.9 Ranks and differences corresponding to each block for set 2 . . . . 53
3.10 Time of two methods with different decay values for set 2 . . . . . . 54
3.11 No. of iterations of two methods with different decay values for set 2 54
3.12 Matrices from set 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.13 Ranks from PageRank vs. CCP for set 3 . . . . . . . . . . . . . . . . 55
3.14 Ranks and differences corresponding to each block for set 3 . . . . 56
3.15 Time of two methods with different decay values for set 3 . . . . . . 56
3.16 No. of iterations of two methods with different decay values for set 3 57
i
3.17 Mean case for 3 sets with different decay values . . . . . . . . . . . 57
List of Tables
1.1 Iterations needed for each value of decay-factor α of SmallSet . . . 17
2.1 The closeness of local PageRank vector to global PageRank vector 25
3.1 Each set with corresponding sources . . . . . . . . . . . . . . . . . . 49
3.2 Each set with corresponding number of pages and links . . . . . . . 49
iii
Introduction
Human beings’ history has witnessed many big improvements but nothing is
comparable to networks’, in general, and the WWW’s improvement, in partic-
ular.
During the last decades, databases have grown exponentially in large stores
and companies. In the old days, seekers faced many difficulties in finding
enough data to feed their needs. The picture has changed and now the reverse
picture is a daily problem – how to find relevant data in the information which
is regularly accumulated. The need therefore arises for better approaches that
are able to handle complex models in a reasonable amount of time because
the old models in statistics, machine learning and traditional data analysis
fail to cope with this level of complexity. That is named data-mining with an
important branch, Web-mining.
The key inspiration of web-mining based on WWW’s growth. There is no
doubt that WWW is the most valuable source of information but it is not very
easy for you to get used to if you do not know the way. Several studies try to
estimate the size of the Web andmost of them agree that there are over billions
of pages. Additionally, the changing rate of the Web is even more dramatic [4].
According to [4], the size of the Web has doubled in less than two years, and
this growth rate is projected to continue for the next two years. Although a
enormous amount of information is available on the Web, you can hardly find
appropriate information if you nothing supporting. For that reason, you need
a search engine to make your work easier.
We can consider search engines roughly as places that store information
and return what we need after some operations. Everything returned by a
iv
search engine is sorted descendingly, that means entries which appear first
seem to be more relevant. Even if that seems normal, many efforts were made
to make this step straightforward. For various ways of approaching, there are
many methods that deal with this problem. In this problem, we have to deter-
mine a page’s importance score in order to sort it in the list of pages [16, 6].
The most famous approach discovered is PageRankTM from Google [1]. This
thesis focuses on PageRank and its modifications to tweak up the computing
process.
With the aim of understanding network phenomena and, especially, PageR-
ank computation, this thesis is organized as follow:
• Chapter 1: Demontrates on the rank concept and PageRank. This chap-
ter is the background on which the following chapters flourish.
• Chapter 2: Takes care of some PageRank-related problems such as ac-
celerating problem, spam. And, a new method is also proposed in this
chapter (section 3.2) which we has published [19].
• Chapter 3: Concentrates on the practical side of PageRank with im-
plementations and experiments. Conclusions and future works are also
mentioned in this chapter.
Acknowledgement
My first words are to acknowledge all those people who had helped, contributed
to the works in this thesis. It is obvious that I can not finish this thesis alone.
It will be an impossible task if I, myself alone, work on all things such as de-
ciding orientation, doing research, implementing... and many more things. I
tried my best, and if your name is not listed, please trust implicitly in me that
my gratitude is not less than those who are listed below.
In the foremost position of my gratitude would stand Assoc. Prof. Ha Quang
Thuy for his supervision, advice and guidance to me from the very first day
when I newly accustom science. His endless enthusiasm for science has in-
spired me in doing research and finally this thesis can be realized. Above all
and most needed, he always tries his utmost to encourage and support me in
various ways. Once again, I would like to send my most nicely thankful words
to him.
Many thanks would go to Assoc. Prof. Hoang Chi Thanh, Head of Depart-
ment of Informatics where I am working, for his help and advice to me in both
work and life. Without his help, my thesis can not be here as it is being now.
It is a pleasure for me to express my grateful thanks to Prof. Pham Ky Anh
for his generous disposition. I was allowed to use many facilities in the Center
for High Performance Computing, where he is the Director, to do research and
these facilities helped forming a big important part of my thesis.
To Doan Duy Hai, Department of Computational and Applied Mathemat-
ics, it is a great joy for me to have such a wonderful friend, colleague like him.
May be he does not know how much I am indebted to him for his kind help.
I thank him for his valuable advice in science discussion and for his precious
vi
time spending on helping me. His precise ideas were quite an unforgettable
assistance to me with my scientific works. Furthermore, with his superb skill
in LATEX, I can have help from an expert to drive the troubles away.
Working environment is a key point in forming the way and attitude of
working to me. I am very lucky to work in a good environment with wise
and scholarly colleagues. I also benefit by academic courses and professional
subjects from these wise and scholarly colleagues including my teachers and
friends. I thank Nguyen Trung Kien who, with his unconditional help of im-
plementing, running and storing data in my experiments, makes things easier
for me.
I have a nice collaboration to the seminar KDD group from Faculty of In-
formation Technology, College of Technology. I was so providential to have sev-
eral opportunities to work, discuss about science with those intellectuals. I
especially thank Nguyen Thu Trang for her help and time working on pro-
gramming with me.
This work was supported in part by the National Project "Developing con-
tent filter systems to support management and implementation public secu-
rity - ensure policy" and the MoST-203906 Project "Information Extraction
Models for discovering entities and semantic relations from Vietnamese Web
pages".
I can not wait anymore to wholeheartedly acknowledge my sweetheart. She
is really an angel whose dedication, love, confidence in me and intelligence
really took me over hard works.
I am extremely lucky to be born in my family. My father, the first person
who set the basis of my learning character, taught me the joy of intellectual
recreation even when I was a child. My mother, the best mother in the world,
with her unselfish sacrifice, bring me up with gently love. And, my younger
brother, without his spiritual support, may I fail many times.
My last words are to thank to everybody because you deserve that for your
importance to me.
Hanoi, Octorber 25th, 2006
Nguyen Hoai Nam
Abstract
In this thesis, we will study about the whole WWW on the point of view of
social networks. There are many directions to come up to WWW but we chose
the way of social networks because it will give us many advantages in doing
research.
Everytime we heard of WWW, that refer us to the Internet - the biggest
network all over the world. Networks are all around us, all the time. From the
biochemistry of our cells to the web of friendships across the planet. From the
circuitry of modern electronics to chains of historical events. A network is the
result of the forces that shaped it. Thus the principles of network formation can
be, to some extent, deciphered from the network itself. All such information
comprises the structure of the network.
Moreover, the Internet has been improving rapidly, from a small network
for experimental purpose to a world-wide network. The plentiful content of the
World-Wide Web is useful to millions. But infomation is only available if you
know where to find it. This thesis centres around some aspects of network-
related problems:
• How can we find needed infomation? The answer is the appearance of
search engines.
• How is returned information sorted? The key idea is to use a ranking ap-
proach and the most famous approach, PageRankTM, is carefully studied
in this thesis.
• What are raised problems associated to the answers of two previous ques-
tions? These problems cover storage capacity, computational complexity,
ix
accelerating methods, programming skills... and many more.
To check whether the theory to tweak up the existing method is true, we
implemented with data sets downloaded from the Internet. Experimental fig-
ures show that our method is relatively good and it is really promissing.
List of Glossaries
Glossary name Description
CCP Connected-component PageRank
IR Information Retrieval
PR PageRank
URL Unified Resource Locator
WWW World Wide Web
xi
CHAPTER
Objects’ ranks and applications to
WWW
WWW is a very famous entity that you can meet everyday at any
time and in any circumstance. Gradually it prevails on every aspect
of our lives and it is definitely useful. Because of endlessly enormous
information, which can be provided by the WWW, we must have a
scheme to sort things in order to determine which is relatively suit-
able. This chapter will guide you through one of the most topical
problems addressed.
1.1 Rank of objects
Many datasets of interest today are best described as a linked collection of
interrelated objects. These may represent homogeneous networks, in which
there is a single-object type and link type, or richer, heterogeneous networks,
in which there may be multiple object and link types (and possibly other se-
mantic information). Examples of homogeneous networks include single mode
social networks, such as people connected by friendship links, or the WWW,
1
1.1 Rank of objects
a collection of linked web pages. Examples of heterogeneous networks include
those in medical domains describing patients, diseases, treatments and con-
tacts, or in bibliographic domains describing publications, authors, and venues.
These examples may include set of scientific papers. As all of us know, it is
necessary that we credit others’ works if we use their works in our paper. That
can be considered a link between our papers and credit-owner’s works. Link
mining refers to data mining techniques that explicitly consider these links
when building predictive or descriptive models of the linked data. Commonly
addressed link mining tasks include object ranking, group detection, collective
classification, link prediction and subgraph discovery. While network analysis
has been studied in depth in particular areas such as social network analy-
sis, hypertext mining, and web analysis, only recently has there been a cross-
fertilization of ideas among these different communities. This is an exciting,
rapidly expanding area.
In this section, examples on ranking objects in many kinds of datasets are
examined to insist that an approach to sort relevant results is required. Then,
the section will go on with the need of a ranking algorithm for WWW–one of
the most important communities in our contemporary world.
1.1.1 Rank for objects different from web page
Progress in digital data acquisition and storage technology has resulted in
the growth of huge databases. This has occurred in all areas of human en-
deavor, from the mundane (such as supermarket transaction data, credit card
usage records, telephone call details, and government statistics) to the more
exotic (such as images of astronomical bodies, molecular databases, and med-
ical records). Little wonder, then, that interest has grown in the possibility
of tapping these data, of extracting from them information that might be of
value to the owner of the database. The discipline concerned with this task
has become known as data mining.
Data mining, with its aim to find knowledge from a "big mountain" of infor-
mation, needs a way to sort things extracted. We will study some examples to
decide that a ranking algorithm is "must-have" part of every mining system.
2
1.1 Rank of objects
Example 1.1. Consider some online computers store that accept queries on
the Configuration and Price attributes of a goods. Agents could treat query
conditions as if they were regular Boolean conditions then returns relevant
things. This way, agent or any client can determine an acceptable radius of
pattern and price around the preferred products. However, suppose we are
browsing a very big store, there could be too many matching items making
users’ tasks difficult. Additionally, many items that are slightly weaker in
specification but price are very good in the searching range are missing.
Thus, we should think of a new ranking system which ranks results if they
are closest to the Configuration given and have best Price or relatively inex-
pensive.
In addition, we can propose, briefly, a method to rank items as follow.
Example 1.2. Suppose, with two criteria Configuration and Price used, each
criterion is assigned weight as in 0.8c+0.2p where c ∈ [0, 1] shows how close the
result is to the target configuration and p ∈ [0, 1] indicates how close the price
is to the target price. Looking at the formula, if a computer has higher relevant
configuration, it seems to be ranked higher in results. If another store would
want to weigh configuration and price equally, the formula should be 0.5c+0.5p.
Suppose that a customer wants to look for a computer with specific con-
figuration and target price of 500US$. Besides, in the store, there is only one
computer with the specific configuration (c = 1, quite good) and high price
(i.e. p = 0.2). All the remaining computers are in the near configuration with
c = 0.8 and the price for all is moderate, p = 0.5. Note that, with the same or
near configuration, computers can differ in price due to the make of manufac-
turers.
Applying the definition above, we compute the relevance score of each com-
puter type. The computer, which suites best in configuration have score of
0.8 × 1 + 0.2 × 0.2 = 0.84 whereas other computers should have scores of
0.8×0.8+0.2×0.5 = 0.74. Therefore, in this store, the computer with best rele-
vant configuration should be ranked first in results and others will have lower
ranks. What will happen if we consider a store which assigns the same weight
to both configuration and price? Let us have a computation of these computers.
The computer, which suites best in configuration has score of 0.5×1+0.5×0.2 =
3
1.1 Rank of objects
0.6 whereas other computers should have scores of 0.5 × 0.8 + 0.5 × 0.5 = 0.65.
Therefore, in this store, the answer is different. Computers with unlike config-
uration but good price should be the first choice because they have high scores
and the computer with the same configuration but high price would receive a
worse vote from the store’s algorithm.
Problem arises in Example 1.2 is not the only problem. We will face another
problem which is described Example 1.3
Example 1.3. Suppose that there is a service, which combines results from
many online stores to provide information to customers. Therefore, this ser-
vice has to query many stores at once then combines the results into