The following speech corpora were collected to investigate the learning of spoken language (words, sub-word units, higher-level semantics, etc.) from visually-grounded speech.
If you use this data in your own publications, please cite the relevant papers for each dataset.
All of the following datasets are distributed under the Creative Commons Attribution-ShareAlike (CC BY-SA) license (link).
Places Audio Captions (English) 400k
The Places Audio Caption (English) 400K Corpus contains approximately 400,000 English spoken captions for natural images drawn from the Places 205 image dataset. For a description of the corpus, see:
D. Harwath, A. Torralba, and J. Glass, "Unsupervised Learning of Spoken Language with Visual Context," Proc. of Neural Information Processing Systems (NIPS), Barcelona, Spain, December 2016 (PDF)
D. Harwath and J. Glass, "Learning Word-Like Units from Joint Audio-Visual Analysis," Proc. of the Annual Meeting of the Association for Computational Linguistics (ACL), Vancouver, Canada, July 2017 (PDF)
D. Harwath, A. Recasens, D. Suris, G. Chuang, A. Torralba, and J. Glass, "Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input," Proc. of the 2018 European Conference on Computer Vision (ECCV), Munich, Germany, September 2018 (PDF)
This corpus only includes audio recordings, and not the associated images. You will need to separately download the Places image dataset here.
If you use this data in your own publications, please cite the papers above as well as the relevant publications listed on the Places website.
Places Audio Captions (Hindi) 100k
The Places Audio Caption (Hindi) 100K Corpus contains approximately 100,000 Hindi spoken captions for natural images drawn from the Places 205 image dataset. For a description of the corpus, see:
D. Harwath, G. Chuang, and J. Glass, "Vision as an interlingua: Learning multilingual semantic embeddings of untranscribed speech," Proc. ICASSP, Calgary, Canada, April 2018 (PDF)
You will need to download the Places image dataset separately at the link above.
SpokenCOCO (English) 600k
SpokenCOCO (English) 600k contains approximately 600,000 recordings of human speakers reading the MSCOCO image captions out loud (in English). Each MSCOCO caption is read once. For a description of the dataset, please see:
W-N. Hsu, D. Harwath, C. Song, and J. Glass, "Text-Free Image-to-Speech Synthesis Using Learned Segmental Units," at the NeurIPS Workshop on Self-Supervised Learning for Speech and Audio Processing, December 2020.
You will need to download the MSCOCO image dataset separately here.
Flickr8k Audio Captions (English)
The Flickr 8k Audio Caption Corpus contains 40,000 audio recordings of humans reading the original Flickr 8k captions out loud (in English). For a description of the corpus, see:
D. Harwath and J. Glass, "Deep Multimodal Semantic Embeddings for Speech and Images," 2015 IEEE Automatic Speech Recognition and Understanding Workshop, pp. 237-244, Scottsdale, Arizona, USA, December 2015 (PDF)
This corpus only includes audio recordings, and not the original text captions or associated images. The original Flickr 8k corpus, containing the original text captions and links to the source images, is available for download here.
Spoken ObjectNet is hosted on a separate web page.