6S: Distributing crawling and searching across Web peers
6S: Distributing crawling and searching across Web peers
Filippo Menczer Ruj Akavipat Le-Shin Wu
School of Informatics Computer Science Dept. Computer Science Dept.
& Computer Science Dept. Indiana University Indiana University
Indiana University Bloomington, IN 47405 Bloomington, IN 47405
Bloomington, IN 47408
[email protected] [email protected]
ABSTRACT retrieval system, current search engines perform poorly for queries
A collaborative peer network application called 6Search (6S) is that require context based interpretation . Further, various bi-
proposed to address the scalability limitations of centralized search ases introduced to address the needs of the “average” user imply
engines. Each peer crawls the Web in a focused way, guided by the diminished effectiveness in satisfying many atypical search needs.
user’s information context. This way better (distributed) coverage Examples of bias include interfaces (advanced search features are
can be achieved. Each peer also acts as a search servent by submit- often buried and poorly documented), ranking (in favor of preci-
ting and responding to queries to/from its neighbors. This search sion and popularity, to please the majority of users who do not look
process has no centralized bottleneck. A local adaptive routing al- beyond the ﬁrst few hits), and coverage (well connected pages are
gorithm is introduced to dynamically change the topology of the easy for a crawler to ﬁnd and thus more likely to be indexed).
peer network based on a simple learning scheme driven by query We identify the above limitations as problems of scale: in spite
response interactions among neighbors. We validate a prototype of of enormous progress in crawling, indexing, retrieval and ranking,
the 6S network via simulations with 70 model users based on actual the “one engine ﬁts all” model does not — cannot — scale well
Web crawls. We ﬁnd that the network topology rapidly converges with the size, dynamics, and heterogeneity of the Web and its users.
from a random network to a small world network, with clusters Topical or vertical search engines are one approach to address
emerging from user communities with shared interests. We also this problem. Effective topical crawling algorithms have been de-
compare the quality of the results with those obtained by central- signed to support specialized portals [8, 22, 23, 7, 24]. In prior
ized search engines such as Google, suggesting that 6S can draw work one of the authors has investigated a tightly coupled system
advantages from the context and coverage of the peer collective. in which a topical crawler and search engine engage in a symbi-
otic relationship with the crawler feeding the search engine and the
search engine helping the crawler to better its performance. Such
Categories and Subject Descriptors symbiosis can allow a vertical search engine to learn about its com-
C.2.2 [Computer-Communication Networks]: Network Proto- munity’s interests and serve such a community with better focus
cols—Applications; C.2.4 [Computer-Communication Networks]: . Topical portals however remain prone to coverage and scala-
Distributed Systems; H.3.3 [Information Storage and Retrieval]: bility limitations, and maintain a very coarse grained level of per-
Information Search and Retrieval; H.3.4 [Information Storage and sonalization.
Retrieval]: Systems and Software—Distributed systems, informa- There is extensive work on personalization and information cus-
tion networks, performance evaluation (efﬁciency and effectiveness); tomization including applications to Web searching in the AI and
H.4.3 [Information Systems Applications]: Communication Ap- IR literature. (There are too many examples to list here; the reader
plications is referred to the review in  as one starting point.) However,
these effort are generally aimed to very limited information spaces
Keywords — digital libraries, Web sites, databases — and are not designed to
scale with searching the Web at large. An exception is represented
Peer collaborative search, distributed topical crawlers, scalability, by work aimed at making PageRank computations more scalable,
small world networks thus enabling personalized versions .
It is clear that distributed systems are part of the answer to the
1. INTRODUCTION AND BACKGROUND scale problem. Even “traditional” search engines employ distributed
Centralized search engines have difﬁculty with coverage of the and parallel systems to handle the massive computational and stor-
Web  because the Web is large, fast-growing , and dynamic age requirements of indexing, retrieval and ranking . Grub1 is a
[4, 12]. Even the popular press is quickly realizing that like any client-server distributed computing effort (a’la SETI@home )
to distribute the crawling task across many desktop computers in
the hope of achieving high coverage. While this effort is likely use-
ful in discovering unknown Web sites, the centralized integration
Permission to make digital or hard copies of all or part of this work for of crawl index data creates a huge bottleneck for searching.
personal or classroom use is granted without fee provided that copies are Peer network are increasingly seen as a candidate framework for
not made or distributed for proﬁt or commercial advantage and that copies
distributed Web search applications. One model proposed by the
bear this notice and the full citation on the ﬁrst page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior speciﬁc YouSearch project is to maintain a centralized search registry for
permission and/or a fee. query routing (like Napster), but enrich the peers with the capabil-
CIKM 2004 Washington, D.C. USA 1
Copyright 200X ACM X-XXXXX-XX-X/XX/XX ...$5.00. http://www.grub.org
ity to crawl and index local portions of the Web . The central
control in this approach unfortunately makes it difﬁcult to adapt
the search process to the heterogeneous and dynamic contexts of User
the peer users.
A completely decentralized approach is the Gnutella model, in
which queries are sent and forwarded blindly by each peer. The
Web Search Engine
problems of this approach are that peers ﬂooded by requests cannot
manage the ensuing trafﬁc, and that the topology is uncorrelated
with the interests of the peer users. As a result the basic Gnutella
model does not scale well with the number of users and queries. HTTP Peer Network
Adaptive, content based routing has been proposed to overcome
this difﬁculty in the ﬁle sharing setting. NeuroGrid  employs
a learning mechanism to adjust metadata describing the contents
of nodes. A similar idea has been proposed to distribute and per- Network
sonalize Web search using a query-based model and collaborative
ﬁltering . Search however is disjoint from crawling, making it
necessary to rely on centralized search engines for content.
An intermediate approach between the ﬂood network and the Figure 1: 6S protocol stack.
centralized registry is to store index lists in distributed, shared hash
tables . In pSearch  latent semantic analysis [11, 3] is per-
formed over such distributed hash tables to provide peers with key- 2. 6S PROTOCOL
word search capability. This is a promising approach, however Li
As shown in Figure 1, the 6S application (personal search en-
et al. argue that full-text Web search is infeasible in both the ﬂood
gine) layer sits between the user and the peer network layer. The
model and the distributed hash table model .
6S peer network protocol acts as an application layer between the
Another alternative are hybrid peer networks, where multiple
search engine and the network (TCP/IP) layer. The application also
special directory nodes (hubs) provide construct and use content
interfaces with the network using the HTTP protocol for crawling
models of neighboring nodes to determine how to route query mes-
sages through the network . In hybrid peer networks, leaf nodes
The 6S peer network layer provides the means to ﬁnd results
provide information and use content based retrieval to decide which
(hits) by querying the indexes built by peer search engines. When
documents to retrieve for queries.
the user submits a query to its personal search engine, the latter can
In this paper we propose an alternative model for peer-based
retrieve hits from its local index database and augment the results
Web search, which uses the same idea of content based models
by searching the peer network for additional hits.
of neighboring nodes but without assuming the presence of spe-
The design of our protocol is based on the following considera-
cial directory hubs. Each peer is both a (limited) directory hub and
a content provider; it has its own topical crawler (based on local
context), which supports a small local search engine. Queries are 1. Peers are independent.
ﬁrst matched against the local engine, and then routed to neighbor
peers to obtain more results. Initially the network has a random 2. A peer can enter and leave network at any time.
topology (like Gnutella) and queries are routed randomly as in the
ﬂood model. However, the protocol includes a learning algorithm 3. A peer should not be overwhelmed by other peers.
by which each peer uses the results of its interactions with its neigh- 4. A query should not be propagated indeﬁnitely.
bors (matches between queries and responses) to reﬁne a model of
the other peers. This model is used to dynamically route queries 5. A peer may choose not to respond or forward some queries.
according to the predicted match with other peers’ knowledge. The
network topology is thus modiﬁed on the ﬂy based on learned con- 6. The architecture should make it difﬁcult to create denial of
texts and current information needs. Similar ideas are receiving service (DoS) attacks using the service.
increasing attention in the P2P search literature [9, 39, 16].
Below we discuss the message primitives that the protocol uses
The key idea of the proposed peer search network is that the
for communication. The following section discusses the algorithm
ﬂooding problem can be alleviated by intelligent collaboration be-
and parameters of each message type.
tween the peers. This should lead to an emergent clustered topol-
ogy in which neighbor communities tend to form according to clus- 2.1 Message primitives
ters of peers with shared interests and domains. In fact we predict
We do not wish to design an overly complex protocol, which
that the ideal topology for such a network would be a “small world”
could hinder the development of improved protocols in the future.
. This topology allows for any two peers to reach each other
Following are the primitives that we feel one cannot do without.
via a short path (small diameter) while maximizing the efﬁciency
Here we use a simple XML syntax to illustrate peer network mes-
of communication within clustered peer communities. Following
sages. Our prototype protocol is based on these primitives. There
Milgram’s famous experiments on “six degrees of separation” ,
are a few additional primitives that we are considering for future
we named our model 6Search (6S).
implementations as they would enable richer peer interactions and
In the remainder of the paper we illustrate the 6S model in more
more sophisticated search and learning algorithms. Those are omit-
detail by focusing on its architecture and protocols. We also illus-
ted for brevity. It should also be noted that from the network layer
trate how it works by reporting on preliminary experiments using a
(TCP/IP) peers can identify each other during communication (from
simulated 6S network with 70 peers modeled after synthetic users.
their IP addresses, say), so this information is omitted in the peer
Finally we discuss the future of the project.
2.1.1 Query message
The query message is essential for a peer to be able to pass its
queries to other peers on the network. A query owner identiﬁcation
may optionally be attached to the query. We allow this option into
the primitive since a peer may need to identify itself to other peers.
Once a peer receives a query it will decide whether it should
2.1.2 Query response respond or not. If it decides to respond, it will match the query
against its local index database and return Nh results (hits) in the
: response message:
The query response is needed for a peer to be able to respond to
other peers’ search queries. As in the query message primitive, an
optional owner ID is provided so that the responder may identify
2.1.3 Proﬁle request
The proﬁle request is needed to let a peer request proﬁles from
others. The proﬁle describes what a peer has indexed and is ready
for sharing. We use the pull mechanism because it spares a peer Moreover, depending on the similarity between the query and
from the load of having to propagate updates for its own proﬁle. its neighbor proﬁles, the peer may also select some of its neigh-
The cost of a peer having to request for proﬁle information should bors and forward the query to them, as will be illustrated in Sec-
be lower than that of having to keep track of all peers that store tion 2.2.3. Then the peer will forward its neighbors’ responses, if
one’s proﬁle. any, back to the peer who originated the query.
The purpose of TTL is to limit the spreading of a query such that
2.1.4 Proﬁle response
a query will not survive in the network too long and move too far
from the originating peer. This is a standard technique to limit con-
gestion and loops in any network protocol. The TTL is decreased
for every forward and a query will not be forwarded when TTL
: reaches 0. In 6S we may allow for the amount by which the TTL is
decreased by a peer to depend on local variations of the protocol.
We impose a restriction on the way that a peer replies to a for-
warded query. The response must be sent to the neighbor that for-
warded the query, and not directly to the peer who originated the
This primitive allows a peer to respond to a request for its own query. This is because we want to prevent potential DoS attacks
proﬁle. Section 3.3 describes how a proﬁle is generated by a peer. created by exploiting the response system. If a peer responds to a
Such a proﬁle initially consists of a simple list of terms. Later, forwarded query by sending a reply directly to the source, someone
as peers learn about their neighbors’ expertise from the query re- can inject a query with a large TTL into the network together with
sponses they send, proﬁles are updated as described in Section 2.2.3. a spoofed return address. Then all the responses will be directed to
that address, overwhelming the target machine.
2.2 Message detail and protocol algorithms
Let us brieﬂy discuss the detail of each message primitive and 2.2.2 Neighbor management
algorithm in our protocol. Since our goal is to allow peers to form communities without
centralized control, a peer needs to ﬁnd new peers and evaluate
2.2.1 Query processing and response their quality and match. In our design, we choose not to have peers
A peer sends a query consisting of its query keywords and cor- aggressively ﬂooding the network looking for other peers unless it
responding weight of each keyword (weights can be 1 by default). is necessary to do so, such as when the peer enters the network for
Attached with each query are ID, TTL (time to live), and the ﬁrst time or when all known peers are not available. Otherwise,
timestamp. Owner identiﬁcation can be attached as a sign that a peer would discover new peers through its current neighbors. The
one wants to discover new neighbors. ID and timestamp are process that we use in our prototype is to let a peer attach its contact
added to help differentiate each query. The ID has to be unique ID with the query in the ownerid ﬁeld. If the peer that receives
only locally between two peers. a query wants to become a neighbor of the requesting peer, it will
Peer’s Neighbor’s User Interface Final Result
Peer Peer’s Neighbor Neighbor
(Query 1, Peer’s ID) Query
(Query 1, Peer’s ID) Combinator Module
Local Search Module Search Result
(Response, Neighbor’s Neighbor Search Neighbors
Neighbor’s id) Query Result
Neighbor Information Response Communication
Module Response From Module
Figure 2: 6S neighbor discovery. Neighbor
response with its own contact ID in the ownerid ﬁeld of the re-
sponse message. The new neighbor peers can later contact each Figure 3: 6S peer architecture.
other directly. The process is illustrated in Figure 2.
In addition to the mechanism for discovering new neighbors, we
also need to consider the issue of how often or when a peer will (c) New peers that respond to discovery signals are added
want to ﬁnd more neighbors. A simple approach is to give each to the list of known peers, with their proﬁle.
peer a ﬁxed number of slots for neighbors, Nn . This number can
vary among peers depending on their bandwidth and computational 3. For the next query (generated locally or forwarded), known
power to process neighbor data. We assume that Nn is ﬁxed for peers are ranked by similarity between the query and the peer
each peer. A peer will search for new peers when its neighbor slots descriptions. A randomized ranking function is used.
are not full or when it wants to ﬁnd better neighbors than the cur-
4. The top Nn ranked among known peers are selected as neigh-
rently known peers.
bors and sent the query.
Each peer may of course know about more than Nn peers. Let
us call Nk (t) the number of peers known at time t. This number 5. Goto step 2.
can grow arbitrarily, but probably will be capped at some parameter
determined by the peer application’s available memory or storage. In the above algorithm, a proﬁle is requested only when a peer
A peer must prune neighbor information as needed. is ﬁrst discovered. However, peers are not static; they change as
their index databases are updated following successive crawls. A
2.2.3 Adaptive query routing peer may even change its focus substantially as the interests of its
The actual set of Nn neighbors, i.e. those to whom queries are user base shift. This is the reason behind the randomized ranking
sent, is selected dynamically for each query at time t among the function of peers for routing queries; a peer will be occasionally
Nk (t) known peers. Sophisticated algorithms have been proposed rediscovered through this mechanism, and if it yields good results,
for determining the quality of peers . Here instead we propose its proﬁle can be requested again. This way we ensure probabilistic
a very simple adaptive routing algorithm to manage neighbor in- updates of peer proﬁles driven by query-based interactions.
formation and to use such information for dynamically selecting
neighbors to query: 3. 6S ARCHITECTURE
1. When a peer is ﬁrst discovered, a proﬁle is requested. A Using the protocol described in the previous section, we cre-
vector representation for the peer proﬁle is then initialized ate the architecture of a peer as shown in the data ﬂow of Fig-
with the list of keywords contained in the proﬁle response, ure 3. Each peer has a local search engine with its own index
each with unit weight. database. The peer not only processes its own local queries but
also the queries that are passed to it by other peers. The peer con-
2. Responses from neighbors (and neighbors’ neighbors) are tains seven basic modules, allowing us to easily test and modify
evaluated and used to update the proﬁle of each known peer. each component separately. The system is implemented in Java to
take advantage of a number of code libraries available from other
(a) The scores of the Nh hits received from each neighbor sources.
are compared with local hit scores. (We simply assume
that the scoring criteria are uniform across all peers.) 3.1 User Interface module
(b) If the score of any neighbor hit is better than at least This module is where the peer search system accepts queries
one of the top Nh local scores: from the user and displays the results back to the user. When the
user enters a query, this module distributes it to three other modules
i. the query keywords are added to the neighbor pro- (Combinator, Local Search, and Neighbor Information) where the
ﬁle vector with a weight corresponding to the score, query is further processed.
ii. a discovery signal (ownerid) is sent with the next
query to that neighbor.
Initialize all peers
3.2 Local Search module Initialize simulated network at random
This module handles the search task on a local index created While not terminated
from shared personal ﬁles, bookmarked pages, and pages crawled For each peer
by the local Web crawler. We use an open-source Web search en- If user submits a query
Process query on local search engine
gine, Nutch,2 as the local indexing and database code for this mod-
Send query to appropriate neighbors
For the topical crawler we use a best-N-ﬁrst search algorithm If a response to a previous query is received
developed in prior work, which has proven very effective against a If response is for local user
number of crawling algorithms in the literature [24, 27, 26, 28]. A Evaluate neighbor
detailed description of this crawling algorithm is outside the scope Combine hits with other hits
Output to user
of this paper and can be found in the above references. Brieﬂy,
the crawler is given a set of seed URLs to start from and a set of Forward response to query sender
topic keywords. The idea is that URLs to be visited are prioritized EndIf
by the similarity between the topic and the page in which a URL EndIf
is encountered. Some additional mechanisms guarantee that the If a query is received
crawler is sufﬁciently explorative. This crawler is also publicly Process query on local search engine
Send response back
If TTL > 0
The peer crawler can be seeded with pages bookmarked by the Decrease TTL
user, or hits returned by a search engine based on a user proﬁle, or Forward query to appropriate neighbors
pages visited recently by the user. The topic keywords, if not given EndIf
explicitly by the user, can be extracted from the user proﬁle or from EndIf
the queries submitted by the user to search engines during the day. If a request for profile is received
The results of a search are sent to different modules based on the
Send profile to requester
origin of the query. If the query comes from the user, the results are EndIf
sent ﬁrst to the Combinator module to be merged with hits obtained EndFor
from other peers, and they are also sent to the Neighbor Information EndWhile
Module to assist in evaluating neighbors’ responses to that query. If
the query comes from another peer the results will be sent back to
that peer by way of the Neighbor Information and Communication Figure 4: Simulator pseudocode.
3.3 Neighbor Information Module and the peer network layer. It is responsible for all communication
with other peers. The tasks of this module include passing queries,
This module handles how a peer responds to the others, which in-
results and other messages between the other modules of the local
cludes evaluating qualities of each neighbor and determining which
peer and the external peers.
known peers are best neighbors for sending or forwarding a par-
ticular query. The module contains a database that stores known
peer information, which is continually updated according to the al- 4. EXPERIMENTAL SETUP
gorithm in Section 2.2.3 each time that a response is received. The To analyze the behavior of 6S peer network interactions, we cre-
module also handles how much information is provided in response ated a simulator that allows us to model synthetic users and run
to neighbor requests for a peer proﬁle. their queries over real indexes obtained from actual distributed Web
As mentioned in the previous section, the evaluation of a new crawls. The goal of the simulator in the preliminary experiments
peer begins with a description received from that peer. Since Nutch outlined below is to study the statistics of 6S’s emergent peer net-
provides an interface for retrieving the highest frequency terms work topology and compare the distributed peer framework with its
from a search index, we use this as a simple way for a peer to centralized counterpart.
create its own proﬁle, to be sent to other peers upon request. This Our simulator takes a snapshot of the network for every time
is done by extracting the most frequent terms from the local index step. In a time step of the simulator, all of the peers process all of
database. their buffered incoming messages and send all of their buffered out-
When the Neighbor Information Module receives a query, whether going messages. This may include the generation of a local query
from a user or from other peers, it dynamically selects a set of as well as responding to the queries received by other peers and
neighbors from its database of known peers, based on the query. forwarding them. The pseudocode for the simulator is shown in
The Nn parameter can be set by the user to limit the maximum Figure 4.
number of neighbors to whom the module can forward queries. There are N = 70 peers in our simulation experiment. In order
to study whether the adaptive routing algorithm of the 6S network
3.4 Combinator module can generate network topologies that capture the interests shared
This module combines and re-ranks the hits obtained from the by user communities, thus reducing query ﬂooding problems, we
Local Search Module with those contained in responses received modeled synthetic users belonging to 7 different groups of 10 users
from peer neighbors. each. Each group is associated with a general topic. Each peer
has its own search engine database, but for the peers in a given
3.5 Communication module group the search engines are built by topical crawlers focusing on
This module acts as the interface between the peer application the same topic. For example if a group’s topic is “sports,” then
all the peer search engines in this group focus on different aspects
http://www.nutch.org of sports. Two points are to be emphasized here. First
- Related pdf books
- Reference for Chapter 11: Conic Sections
- Addressing the Intrinsic Disorder Bottleneck in Structural ...
- CHAPTER 5
- Designing and Developing an Information System - I450/I451 ...
- Kinetics at High Enzyme Concentration
- Intelligent peer networks for collaborative Web search
- state of turtle defined as x, y, α), coordinates ) and angle ...
- Search Engine-Crawler Symbiosis: Adapting to Community Interests
- Simple Permutations Mix Well - Informatics: School of ...
- I400 Introduction to Bioinformatics: Genes and Blue Genes
- IMPROVING SEQUENCE ALIGNMENTS
- On the Lack of Typical Behavior in the Global Web Trafﬁc ...
- Lecture Notes for I500 Fall 2009 Fundamental Computer ...
- GiveALink: Mining a Semantic Network of Bookmarks for Web Search ...
- Introduction to Probabilistic Modeling - Informatics: School ...
- IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND ...
- (Im)possibility of Safe Exchange Mechanism Design