The SLS group has produced a variety of software for converstional
interaction and spoken language processing. The software listed below
is publicly available to support research efforts in the speech and
- The MIT
Finite-State Transducer (FST) Toolkit is available for download as
open source software (BSD license). It is known to build and run in
various flavors of Linux with various versions of GNU GCC/G++,
including 64-bit Linux. It also compiles under Visual Studio 2005 for
- The MIT Language
Modeling (MITLM) Toolkit is a set of tools designed for the
efficient estimation of statistical n-gram language models involving
iterative parameter estimation. It achieves much of its efficiency
through the use of a compact vector representation of n-grams.
Over the years the SLS group has been involved in a wide variety of
data collection efforts, such as TIMIT, ATIS, WSJ, and Communicator.
The corpora listed below are publicly available to support research
efforts in speech and language processing.
- The Crowdsourced Language Assessment Corpus (CLAC) consists of audio recordings and automatically-generated transcripts from 1,832 speakers for several speech and language tasks, as well as metadata for each of the speakers.
- Spoken ObjectNet is a corpus of 50,273 English spoken audio captions for the images in the ObjectNet dataset.
- The Flickr8k Audio Caption
Corpus is a corpus of spoken audio captions for the images
included in the Flickr8k dataset.
- The Places Audio Caption
Corpus is a corpus of 400,000 English and 100,000 Hindi free-form,
spoken audio captions for images from the MIT Places 205 dataset.
- The SpokenCOCO Audio Caption
Corpus is a corpus of 600,000 English spoken audio captions for
the MSCOCO dataset.
- The MIT
Restaurant Corpus is a semantically tagged training and test
corpus in BIO format.
- The MIT
Movie Corpus is a semantically tagged training and test
corpus in BIO format. The eng corpus are simple queries, and the
trivia10k13 corpus are more complex queries.
- The Arabic Fact-Checking and Stance Detection Corpus is a collection of claims and corresponding articles returned by Google search. It contains factuality annotations of claims, as well as stance annotation of claim-article pairs. This is the first corpus integrating both factuality and stance.
- The Arabic Dialect Identification for 17 countries (ADI17) dataset is over 3,000 hours of Arabic dialect speech data from 17 Arabic countries collected from YouTube for fine-grained Arabic dialect identification and analysis.