Social network analysis is one of the most active topics in the central of research nowadays. It has been widely used in various domains such as sociology, biology, economics, as well as information science.
From the very early start, researchers used the concept of centrality to analyze networks. In 1948, Bavelas [14] proposed the idea of centrality as applied to human communication. He was specifically concerned with communication in small groups and hypothesized a relationship between structural centrality and influence in group processes.
44 trang |
Chia sẻ: vietpd | Lượt xem: 1745 | Lượt tải: 0
Bạn đang xem trước 20 trang tài liệu Luận văn Social network analysis - Nguyễn Hữu Bình Minh, để xem tài liệu hoàn chỉnh bạn click vào nút DOWNLOAD ở trên
~ 1 ~
Abstract
Social network analysis is one of the most active topics in the central of
research nowadays. It has been widely used in various domains such as
sociology, biology, economics, as well as information science.
From the very early start, researchers used the concept of centrality to
analyze networks. In 1948, Bavelas [14] proposed the idea of centrality as applied
to human communication. He was specifically concerned with communication in
small groups and hypothesized a relationship between structural centrality and
influence in group processes.
For years, it has been agreed that centrality is an important structural factor
of social networks, and many measures of centrality have been proposed,
including four widely used measures: degree centrality, betweenness centrality,
closeness centrality, and eigenvector centrality [34].
The Web is an example of social network, references from page to page
create a hyperlink structure of the internet. The most interesting application of
analyzing this network is information retrieval system (or search engine). After
crawling web pages to a local store, we create a network based on the links
between the pages, and then compute the quality of each page, which is called
static rank. The static rank helps information retrieval systems to return more
relevant results to a query. PageRank and HITS are the two most widely used
algorithms in today search engines to calculate the static rank.
Besides, social networking sites, known as blog in another word, have
become more and more popular. These sites have its own properties that
challenge traditional search engines in some context, such as users searching for
users, which we have to find all users that have the shortest path to the user
~ 2 ~
issuing the query [23]. It is also possible to apply PageRank to blog search, but
with some modification to fit the blog’s properties.
Recently, several local search engines have appeared in Vietnam, including
xalo, 7sac, baamboo, socbay, headvances, etc, but only three o of them, xalo, bamboo
and headvances, have blog search, and none uses any link-based ranking
algorithm to improve their ranking.
We consider that there is a link between two bloggers if one of them left a
comment on the other. More precisely, we model these relations as a network
with nodes are bloggers and ties are “commenting” relations. If blogger A left n
comments on blogger B, we construct two corresponding nodes A and B, and a
directional tie from A to B with the weight n. We have modified the PageRank
algorithm to take the weight of tie into account, which calculate the static rank of
each blogger more precisely.
~ 3 ~
Acknowledgement
I would like to thank my supervisors, Assoc. Prof. Dr Ha Quang Thuy and
Ms. Nguyen Thu Trang at College of Technology, VNUH, for all their
understanding, supports and encouragements that help me finish this thesis.
I also want to thank my colleagues at Tinh Van Media for all their helps,
especially Mr. Pham Thuc Truong Luong and Mr. Nguyen Quan Son for allowing
me doing experiments with their search platform.
My last words are to thank my dear friends, who always beside me,
encourage me and spend time proofreading the manuscript.
~ 4 ~
Contents
Abstract .............................................................................................................................. 1
Acknowledgement ........................................................................................................... 3
List of Figures.................................................................................................................... 5
Chapter 1............................................................................................................................ 6
Introduction to Social Network.......................................................................................6
1. Social network.............................................................................................. 6
2. Network construction................................................................................. 8
3. Network representation ........................................................................... 10
4. A brief introduction of graph theory...................................................... 12
5. Social network’s characteristics............................................................... 14
6. Social network analysis – SNA................................................................ 17
Chapter 2.......................................................................................................................... 19
Ranking in social network – Social rank......................................................................19
1. Introduction ............................................................................................... 19
2. Ranking in social networks...................................................................... 20
Chapter 3.......................................................................................................................... 29
Ranking bloggers and Experiments .............................................................................29
1. Background and Motivation.................................................................... 29
2. Ranking bloggers by PageRank .............................................................. 34
3. Experiment setup and Results................................................................. 35
Conclusion and Future works ...................................................................................... 40
Biblography ..................................................................................................................... 41
~ 5 ~
List of Figures
Figure 1: A symmetric relationship ............................................................................... 6
Figure 2: A directional relationship. .............................................................................. 6
Figure 3: Internet Alliances ............................................................................................. 8
Figure 4: A socio-gram................................................................................................... 10
Figure 5: Graph and adjacent matrix........................................................................... 11
Figure 6: six degrees of separation............................................................................... 15
Figure 7: Real world example of small world networks. ......................................... 16
Figure 8: The Kite Network........................................................................................... 21
Figure 9: An example showing how pagerank works .............................................. 26
Figure 10: Đầu gấu’s blog .............................................................................................. 33
Figure 11: The corresponding network of Đầu gấu’s blog. ...................................... 34
Figure 12: Blog Ranking Architecture ......................................................................... 35
Figure 13: A part of the Yahoo 360 network............................................................... 37
Figure 14: Top 10 bloggers based on number Of comments.................................... 38
Figure 15: Top 10 bloggers based on PageRank ........................................................ 38
~ 6 ~
Chapter 1
Introduction to Social Network
1. Social network
Social network is a social structure made of nodes and ties, where nodes
might be people, groups, organizations… and ties might be relations, flow or
exchange between the nodes [33].
In the simplest form, the network contains two nodes and one relationship
that connects them [12]. The context might be people studying at the same
university. As you can see Minh and Thu has a relationship because they study at
the same class at university, so in this kind of network, there is a tie between the
two nodes Minh and Thu.
Figure 1: A symmetric relationship
The previous network is un-directional or symmetric, that mean A knows B
and B knows A as well, such relationships are friendships, neighbor, kinship,
companionship, or just living in the same room. But in reality, there are a lot of
relationships which are directional such as financial exchange, like (dislike),
information flow, or disease transmission. For instance, Minh likes Thu, but Thu
might not like Minh.
Figure 2: A directional relationship.
studying at the same university
Minh Thu
likes
Minh Thu
~ 7 ~
More complex networks have multi-relationships. These networks model
many kinds of relationship between objects, or there might be many different ties
between some two nodes [12].
Relationships might be more than sharing some attributes or being at the
same place at the same time; the flow between the objects can form a relationship.
Liking, for example, might lead to an exchange of gifts. In an organization, there
is the flow of knowledge between people; they share information, experiences…
and constitutes a network [12].
A tie might have a weight associated with it, explaining the strength of the
relationship between the two objects. A long time friendship should be stronger
than the friendship with someone you have just said “hi” in the street.
Social network is unnecessary to be social in context. There are many real-
world instances of technological, business, economic, and biologic social
networks; such as electrical power grids, telephone call graphs, the World Wide
Web, co-authorship and citation networks of scientists, the spread of computer
viruses or water flow network in a city. The exchange of emails within
organizations, newsgroups, chat rooms, friendships are examples from sociology
[16].
~ 8 ~
Figure 3: Internet Alliances
Source:
2. Network construction
Given a set of nodes, there are several strategies to collect information
(objects and relations) and creating a network. The first approaches are full
network methods, which yields the maximum of information, but can also be
costly and difficult to execute, and may be difficult to generalize. On the other
hand, there are approaches that yield considerably less information about the
network structure, but are often less costly, and often more easily generalize from
the observations in the sample to some large population. And there is no one
right way for all research questions and problems; each method has their own
advantages and disadvantages.
In this section, I will introduce an overview of 4 major methods in practice,
refer to [29] for more details.
~ 9 ~
2.1.1. Full network methods
This approach begins with a set of actors and tries to collect information
(relations or ties) with all other actors. For example, we could collect friendship
data from all pairs of students in a college; we could count the number of vehicles
moving between all pairs of cities or look at the flow of email between all pairs of
employees in an organization.
Because we collect information between all pairs of actors, full network
methods draw a complete picture of relations in the population. Full network
data is needed to properly define and measure many structural concepts of
network analysis. The disadvantages of this approach is the cost of collecting
information; the process is very expensive .
2.1.2. Snowball methods
In these methods, we choose a set of actors as a starting point. We then
include some other actors who have connections with each actor in the set. The
process continue until no new actors are indentified, or until we decide to stop.
Isolated actors are not located by this method, and the structure of the
network depends greatly on how we choose the initial actors.
2.1.3. Ego-centric networks (with alter connections)
It will not feasible and necessary to track down the full networks beginning
with some initial nodes as in the snowball method for many cases. We can also
begin with a set of some initial nodes and identify nodes that have connections
with the initial nodes. Then, we determine which of the nodes identified in the
first stage are connected to one another.
~ 10 ~
2.1.4. Ego-centric networks (ego only)
Ego-centric methods really focus on the individual, rather than on the
network as a whole. These methods collect information on the connections
among the actors connected to each focal ego, which still present a pretty good
picture of the “local” networks, or “neighborhoods” of individuals. Such
information is useful for understanding how networks affect individuals.
3. Network representation
In order to analyze the social network, we need a way to represent it in a
computational structure and to see how it looks like. Network analysis use
graphs and adjacent matrices to model social networks, and use graph theories to
do analyzing.
Graphs are a very useful ways to present information about social networks.
In simple networks, it is easy for us to look at the graph and predict patterns of
information. Network analysis uses one kind of graphic display that consists of
points to represent objects or nodes, and lines to represent ties or relations. The
graphic is called socio-gram. They use various colors, shapes, names, etc, to
represent different actors and relations [29].
Figure 4: A socio-gram
Source:
~ 11 ~
In more complex networks, when there are thousands of actors and many
different kinds of relations, graphs (social-grams) can become very visually
complicated that it is difficult to see patterns. In this situation, we can represent
information about social networks in the form of matrices. This approach allows
the application of mathematical and computer tools to summarize and find
patterns [29].
The most common form of matrix in social network analysis is adjacent
matrix, a square matrix with as many rows and columns as there are actors in the
network. The weights or scores in the cells of the matrix show information about
the ties between each pair of actors. This kind of matrix represents who is next to,
or adjacent to whom in the “social space” mapped by relations that we have
measured [29].
Figure 5: Graph (right) and adjacent matrix (left)
Source: [25]
Formally, we represent a network as a graph G = consisting of a set of
vertices V = {vi} that represent social entities and a set of edges E = {eij} where eij
represent information of the connection between the nodes i and j [25].
~ 12 ~
4. A brief introduction of graph theory
A necessary course in social network analysis is graph theory. As social
networks can be represented as graphs, understanding fundamental concepts in
graph theories is essential. In this section we will give some concepts that are
often used when analyzing networks. More details can be found at [29].
The degree of a node is defined as the number of ties incident upon that node.
In directed graph, each node has both indegree and outdegree. The indegree is the
number of ties pointing to the node, whereas the outdegree is the number of ties
pointing out from that nodes.
A path is an alternating sequence of nodes and ties, beginning at a node and
ending at a node, and which does not visit any node more than once.
A walk is like a path except that there is no restriction on the number of times
a point can be visited. A path is a kind of walk.
A cycle is just like a path except that it starts and ends at the same point.
The length of a path or walk (or cycle) is defined as the number of ties in it.
A path between two nodes with the shortest length is called a shortest path
(also a geodesic) between the two nodes. It is not always unique (that is, there
may be several paths between the same two points that are equally short). The
graph-theoretic distance between two nodes is defined as the length of the shortest
path between them.
A graph is connected if there exists a path (of any length) from every node to
every other node. The longest possible path between any two nodes in a
connected graph is n-1, where n is the number of nodes in the graph.
~ 13 ~
A node is reachable from another node if there exists a path of any length
from one to the other.
A connected component is a maximal sub-graph in which all nodes are
reachable from every other. Maximal means that it is the largest possible sub-
graph: you could not find another node anywhere in the graph such that it could
be added to the sub-graph and all the nodes in the sub-graph would still be
connected.
For directed graphs, there are strong components and weak components. A
strong component is a maximal sub-graph in which there is a path from every
node to every node following all the arcs in the direction they are pointing. A
weak component is a maximal sub-graph which would be connected if we
ignored the direction of the arcs.
A cutpoint is a vertex whose removal from the graph increases the number of
components. That is, it makes some points unreachable from some others. It
disconnects the graph.
A cutset is a collection of points whose removal increases the number of
components in a graph. A minimum weight cutset consists of the smallest set of
points that must be removed to disconnect a graph. The number of points in a
minimum weight cutset is called the point connectivity of a graph. If a graph has a
cutpoint, the connectivity of the graph is 1. The minimum number of points
separating two nonadjacent points s and t is also the maximum number of point-
disjoint paths between s and t.
A bridge is an edge whose removal from a graph increases the number of
components (disconnects the graph). An edge cutset is a collection of edges whose
removal disconnects a graph. A local bridge of degree k is an edge whose
removal causes the distance between the endpoints of the edge to be at least k.
~ 14 ~
The edge-connectivity of a graph is the minimum number of lines whose removal
would disconnect the graph. The minimum number of edges separating two
nonadjacent points s and t is also the maximum number of edge-disjoint paths
between s and t.
5. Social network’s characteristics
In the late of 1950s, two mathematicians Erdös and Rényi created a great
important theory in graph by modeling many real world networks by a special
type of graph – random graph. To create a random graph with n nodes and m
ties, they put n nodes next to each other, take pair of node at random and tie
them together, the process continues until the graph has m ties. Erdös and Rényi
realize that “when m is small, the graph is likely to be fragmented into many
small clusters” (components), “as m increases the components grow”. For m >
n/2, all nodes are connected to each other [31].
Beside regular and random graph, the two extreme types of graph, network
analysts also study some other types of networks, two most important of them
are small world and scale free networks.
5.1. Small world networks
The experiments conducted by Stanley Milgram and his colleagues for social
networks of people in the United States raising the concept of “small world”. The
phrase captures the initial surprise between two strangers (“What a small
world”) when they realize that they are indirectly connected to one another
through mutual friends. People in Kansas and Nebraska were asked to direct
letters to strangers in Boston by forwarding them to friends who thought might
know the strangers in Boston. And half of