The corpus that was used in this work (Introduction to Algorithms - Cormen, Leiserson, Rivest & Stein) is not available in the public domain. Therefore, we have given an alternative below, which is a dataset in XML format containing the word & noun phrase statistics of each section of the corpus.
Also available for download are the original source code, and the code modified to work off the alternate corpus given here.
[ Original Source Code ]You can also download all of the above bundled into one archive:
[ Complete Archive ]The code makes use of version 2.7.0 of the C++ Xerces XML library. You can download binary & source distributions of the library from the Xerces' project home page.
All the code except for the baselines are in C++, and while our compilation environment was gcc 3.3.5 on linux, the code should compile & run on any environment with a newer version of gcc.
The baselines are written in python.
All of the archives contain a readme file with useful information. The source archives include the necessary makefile in GNU make format.