Next: 1. Searching a directed
MASSACHVSETTS INSTITVTE OF TECHNOLOGY
Department of Electrical Engineering and Computer Science
6.001--Structure and Interpretation of Computer Programs
Spring Semester, 1998
Problem Set 6
- Issued: Tuesday, March 10
- Tutorial preparation for: Week of March 16
- Written solutions due: Friday, March 20 in recitation
- Reading: Read
sections 3.1 and 3.3.1 before lecture on March 12. Read section 3.2
before lecture on March 17.
Searching the World Wide Web
This problem set explores some issues that arise in constructing a
``spider'' or an ``autonomous agent'' that crawls over the documents
in the World Wide Web. For our purposes, the Web is a very large (and
growing!) collection of documents. Each document contains some text
and also links to other documents, in the form of URLs.
In this problem set we'll be working with programs that can start with
an initial document and follow the references to other documents to do
useful things. For example, we could
construct an index of all the words occurring in documents, and make
this available to people looking for information on the web (as in
Digital Equipment Corporation's AltaVista system at
http://AltaVista.digital.com).
Just in case you aren't fluent with the details of HTTP, URLs, URIs,
HTML, XML, XSL, HTTP-NG, DOM, and the rest of the alphabet soup that
makes up the technical details of the Web, here's a simplified version
of what goes on behind the scenes:
-
The Web consists of a very large number of things called documents,
identified by names called URLs. For example, the 6.001 home page has
the URL http://mit.edu/6.001. The URL contains the name of a
protocol (HTTP in this case) that can be used to fetch the document,
as well as the information needed by the protocol to specify which
document is intended (mit.edu/6.001 in this case).
-
By using the HTTP protocol, a program (most commonly a browser but any
program can do this--and ``autonomous agents'' and spiders are
examples of such programs that aren't browsers) can retrieve a
document whose URL starts with HTTP:. The document is
returned to the program, along with information about how it is
encoded, for example, ASCII or Unicode text, HTML, images in GIF or
JPG or MPEG or PNG or some other format, an Excel or Lotus
spreadsheet, etc.
-
Documents encoded in HTML form can contain a mixture of text, images,
formatting information, and links to other documents. Thus, when a
browser (or other program) gets an HTML document it can extract the
links from it, yielding URLs for other documents in the Web. If these
are in HTML format, then they can be retrieved and will yield yet
more links, and so on.
-
A spider is a program that starts with an initial set of URLs,
retrieves the corresponding documents, adds the links from these
documents to the set of URLs and keeps on going. Every time it
retrieves a document, it does some (hopefully useful) work in
addition to just finding the embedded links.
-
One particularly interesting kind of spider constructs an
index of the documents it has seen. This index is similar to
the index at the end of a book: it has certain key words and phrases,
and for each entry it lists all of the URLs that contain that word or
phrase. There are many kinds of indexes, and the art/science of
deciding what words or phrases to index and how to extract them is at
the cutting edge of research (it's part of the discipline called
information retrieval). We'll talk about full text
indexing, which means that every word in the document (except,
perhaps, the very common words like ``and,'' ``the,'' ``a,'' and
``an'') is indexed.
Next: 1. Searching a directed
Hal Abelson
Thu Feb 26 17:53:59 EST 1998