next up previous
Next: Ongoing and Future Work Up: Distributed Web-Crawling Previous: Implementation.

Timing Measurements.

In our experiment, a Bayanihan server was run on a Pentium 166Mhz PC connected to the MIT LCS network via a 28.8Kbps modem (with compression). Then, we used worker clients running on the PC or on other machines The web crawler started at http://www.yahoo.com/, and ran until it found 3000 new URLs. A watcher client on the PC allowed the PC user to see the results.

Table3 shows some preliminary results. The five scenarios shown are: worker on the PC via modem (PC-mdm), worker on the PC via Ethernet (PC-net), workers on one, two, and three SparcStations via Ethernet (1,2,3S). The times in Table3 are taken from the difference between the times for finding 500 and 2500 URLs. This is done to mask irregularities in startup times. Equivalent speeds, and speedups relative to the PC-modem and one-SparcStation case are also shown.

  
Table 3: Timing Measurements for the distributed web-crawler.
measurement PC-mdm 1S 2S 3S  
time for 2000 URLs (s) 223 149 92.2 56.1  
speed (URL/s) 8.97 13.4 21.7 35.7  
speedup (PC-modem) 1 1.49 2.42 3.98  
speedup (S)   1 1.62 2.66  
efficiency (S)   100% 81% 88.7%  


As expected, the data shows that the user gets a speedup by using more workers. More interestingly, however, it also shows how the PC on the slow network link is able to take advantage of workers on faster links to effectively increase its communications capabilities. (Note to reviewers: the question may arise as to whether the performance improvement is not due to improved communications, but to improved computational power. Benchmarks using the factoring application, and timing statistics from individual work packets in the crawler application show that the SparcStation's Java interpreter is slower, and that communications is indeed the bottleneck. However, we admit that it would be more convincing to do this experiment using the same kind of machine on all sides, so we plan to repeat this experiment before the final version is due.)

Note that the speedups achieved here are actually far from ideal, and can potentially be much higher. This is because although the SparcStation workers connected via Ethernet are able to download web pages very quickly, they are still slowed down considerably by the need to report results back to the server (on the other side of the modem link) for each web page checked. A possible solution to this (which we are planning to try), is to extend BasicWorkEngine and implement a version of it which caches new results and new work to minimize communication with the server.


next up previous
Next: Ongoing and Future Work Up: Distributed Web-Crawling Previous: Implementation.
Luis Sarmenta
1/2/1998