Generating a Table-of-Contents

 S.R.K. Branavan, Pawan Deshpande, Regina Barzilay

Abstract

This paper presents a method for the automatic generation of a table-of-contents. This type of summary could serve as an effective navigation tool for accessing information in long texts, such as books. To generate a coherent table-of-contents, we need to capture both global dependencies across different titles in the table and local constraints within sections. Our algorithm effectively handles these complex dependencies by factoring the model into local and global components, and incrementally constructing the model's output. The results of automatic evaluation and manual assessment confirm the benefits of this design: our system is consistently ranked higher than non-hierarchical baselines.

Code & Data

The corpus that was used in this work (Introduction to Algorithms - Cormen, Leiserson, Rivest & Stein) is not available in the public domain. Therefore, we have given an alternative below, which is a dataset in XML format containing the word & noun phrase statistics of each section of the corpus.

Also available for download are the original source code, and the code modified to work off the alternate corpus given here.

       [ Original Source Code ]
       [ Modified Source Code ]
       [ Baselines ]
       [ Corpus ]

You can also download all of the above bundled into one archive:

       [ Complete Archive ]

The code makes use of version 2.7.0 of the C++ Xerces XML library. You can download binary & source distributions of the library from the Xerces' project home page.

All the code except for the baselines are in C++, and while our compilation environment was gcc 3.3.5 on linux, the code should compile & run on any environment with a newer version of gcc.
The baselines are written in python.

All of the archives contain a readme file with useful information. The source archives include the necessary makefile in GNU make format.