An Example: Distributed Web Crawling

Consider a web crawler (i.e., an application that recursively follows links in web pages, compiling a list of pages as it goes along). Such an application is extremely communications-bound--i.e., most of its time is spent downloading web pages from the network. Thus, it can take a long time, even on a fast processor, if network bandwidth is limited. This leads to an interesting idea: what if we can use volunteer computing to get other machines, with their own (possibly faster) network connections, to do the crawling for us?

To test this idea, we implemented a distributed web-crawling application using the Bayanihan framework, and used it in the setup shown in Fig.2. We simulated a fast computer with a slow network link by placing the user on a 166 MHz Pentium PC connected to the network through a 28.8Kbps modem. We ran a Bayanihan server on this machine, and a watcher client so the user can see the results. Then, we measured the time it took to find 4000 URLs starting from http://www.yahoo.com/, first running the worker applet on the same PC (using appletviewer with network restrictions turned off), and then running it on ``volunteer'' UNIX workstations with slower Java interpreters (i.e., the factoring demo runs about 4 times slower on these machines) but faster network links (i.e., 10Mbps Ethernet).

**Figure 2:** The distributed web-crawler experiment.
$\begin{figure} \centerline{\PSbox{crawler.ps hoffset=-22 voffset=-654}{9.2cm}{4.3cm}}\end{figure}$

Table1 shows the results of the experiment using two versions of the application. The first version was based on the same ``basic'' work engine and work manager objects used in the factoring demo. Although this version produced the desired speedups, we noticed that its full potential was supressed by a communications bottleneck--although the workers were able to download web pages very quickly through their Ethernet links, they were slowed down considerably by the need to report results back to the server (on the other side of the modem link) for each work packet before working on the next packet. To remove this bottleneck, we extended the basic work engine and work manager objects to allow the worker to prefetch several packets of work so that it can do new work while simultaneously reporting the results of previously done work. As Table1 shows, this resulted in up to 80% more speedup over the basic version. Furthermore, the new ``multi-work'' versions of the work engine and work manager can be used transparently in place of the basic version, allowing us to improve the performance of other applications as well without the need for recoding.

**Table 1:** Timing Measurements for the distributed web-crawler.
		UNIX-Ethernet
		Basic			Multi-Work
measurement	PC-mdm	1	2	3	1	2	3
time for 4000 URLs (s)	817	391	268	213	222	148	166
speed (URL/s)	4.89	10.2	14.9	18.8	18.0	27.0	24.1
speedup (rel. to PC-mdm)	1	2.09	3.05	3.84	3.68	5.52	4.93
speedup (rel. to 1-UNIX)		1	1.46	1.84	1	1.50	1.34
speedup (rel. to Basic)					1.76	1.81	1.28