Consider a web crawler (i.e., an application that recursively follows links in web pages, compiling a list of pages as it goes along). Such an application is extremely communications-bound--i.e., most of its time is spent downloading web pages from the network. Thus, it can take a long time, even on a fast processor, if network bandwidth is limited. This leads to an interesting idea: what if we can use volunteer computing to get other machines, with their own (possibly faster) network connections, to do the crawling for us?
To test this idea, we implemented a distributed web-crawling
application using the Bayanihan framework, and used it in the
setup shown in Fig.2.
We simulated a fast computer with a slow network link by placing
the user
on a 166 MHz Pentium PC connected to the network through a 28.8Kbps modem.
We ran a Bayanihan server on this machine, and a watcher client
so the user can see the results.
Then, we measured the time it took to find 4000 URLs
starting from http://www.yahoo.com/,
first running the worker applet on the same PC
(using appletviewer with network restrictions turned off),
and then running it on ``volunteer'' UNIX workstations with slower Java
interpreters (i.e., the factoring demo runs about 4 times slower on these
machines) but faster network links (i.e., 10Mbps Ethernet).
Table1 shows the results of the experiment
using two versions of the
application. The first version was based on the same
``basic'' work engine and work manager objects used in the factoring demo.
Although this version produced the desired speedups, we noticed that its
full potential was supressed by a communications bottleneck--although the
workers were able to download web pages very quickly through their Ethernet
links, they were slowed down considerably by the need to report results
back to the server (on the other side of the modem link)
for each work packet before working on the next packet.
To remove this bottleneck, we extended the basic work engine
and work manager objects to allow the worker to prefetch several packets
of work so that it can do new work while simultaneously reporting
the results of previously done work. As Table1 shows,
this resulted in up to 80% more speedup over the basic version.
Furthermore, the new ``multi-work'' versions of the work engine and
work manager can be used transparently in place of the basic version,
allowing us to improve the performance of other applications as well
without the need for recoding.
UNIX-Ethernet | |||||||
---|---|---|---|---|---|---|---|
Basic | Multi-Work | ||||||
measurement | PC-mdm | 1 | 2 | 3 | 1 | 2 | 3 |
time for 4000 URLs (s) | 817 | 391 | 268 | 213 | 222 | 148 | 166 |
speed (URL/s) | 4.89 | 10.2 | 14.9 | 18.8 | 18.0 | 27.0 | 24.1 |
speedup (rel. to PC-mdm) | 1 | 2.09 | 3.05 | 3.84 | 3.68 | 5.52 | 4.93 |
speedup (rel. to 1-UNIX) | 1 | 1.46 | 1.84 | 1 | 1.50 | 1.34 | |
speedup (rel. to Basic) | 1.76 | 1.81 | 1.28 |