No Title

MASSACHVSETTS INSTITVTE OF TECHNOLOGY
Department of Electrical Engineering and Computer Science
6.001--Structure and Interpretation of Computer Programs
Spring Semester, 1998

Problem Set 6

Issued: Tuesday, March 10
Tutorial preparation for: Week of March 16
Written solutions due: Friday, March 20 in recitation
Reading: Read sections 3.1 and 3.3.1 before lecture on March 12. Read section 3.2 before lecture on March 17.

Searching the World Wide Web

This problem set explores some issues that arise in constructing a ``spider'' or an ``autonomous agent'' that crawls over the documents in the World Wide Web. For our purposes, the Web is a very large (and growing!) collection of documents. Each document contains some text and also links to other documents, in the form of URLs.

In this problem set we'll be working with programs that can start with an initial document and follow the references to other documents to do useful things. For example, we could construct an index of all the words occurring in documents, and make this available to people looking for information on the web (as in Digital Equipment Corporation's AltaVista system at http://AltaVista.digital.com).

Just in case you aren't fluent with the details of HTTP, URLs, URIs, HTML, XML, XSL, HTTP-NG, DOM, and the rest of the alphabet soup that makes up the technical details of the Web, here's a simplified version of what goes on behind the scenes:

The Web consists of a very large number of things called documents, identified by names called URLs. For example, the 6.001 home page has the URL http://mit.edu/6.001. The URL contains the name of a protocol (HTTP in this case) that can be used to fetch the document, as well as the information needed by the protocol to specify which document is intended (mit.edu/6.001 in this case).
By using the HTTP protocol, a program (most commonly a browser but any program can do this--and ``autonomous agents'' and spiders are examples of such programs that aren't browsers) can retrieve a document whose URL starts with HTTP:. The document is returned to the program, along with information about how it is encoded, for example, ASCII or Unicode text, HTML, images in GIF or JPG or MPEG or PNG or some other format, an Excel or Lotus spreadsheet, etc.
Documents encoded in HTML form can contain a mixture of text, images, formatting information, and links to other documents. Thus, when a browser (or other program) gets an HTML document it can extract the links from it, yielding URLs for other documents in the Web. If these are in HTML format, then they can be retrieved and will yield yet more links, and so on.
A spider is a program that starts with an initial set of URLs, retrieves the corresponding documents, adds the links from these documents to the set of URLs and keeps on going. Every time it retrieves a document, it does some (hopefully useful) work in addition to just finding the embedded links.
One particularly interesting kind of spider constructs an index of the documents it has seen. This index is similar to the index at the end of a book: it has certain key words and phrases, and for each entry it lists all of the URLs that contain that word or phrase. There are many kinds of indexes, and the art/science of deciding what words or phrases to index and how to extract them is at the cutting edge of research (it's part of the discipline called information retrieval). We'll talk about full text indexing, which means that every word in the document (except, perhaps, the very common words like ``and,'' ``the,'' ``a,'' and ``an'') is indexed.

Next: 1. Searching a directed

Hal Abelson
Thu Feb 26 17:53:59 EST 1998