The Places Audio Caption 400K Corpus contains approximately 400,000 spoken captions for natural images drawn from the Places 205 image dataset. It was collected to investigate multimodal learning schemes for unsupervised co-discovery of speech patterns and visual objects. For a description of the corpus, see:
D. Harwath, A. Torralba, and J. Glass, "Unsupervised Learning of Spoken Language with Visual Context," Proc. of Neural Information Processing Systems (NIPS), Barcelona, Spain, December 2016 (PDF)
D. Harwath and J. Glass, "Learning Word-Like Units from Joint Audio-Visual Analysis," Proc. of the Annual Meeting of the Association for Computational Linguistics (ACL), Vancouver, Canada, July 2017 (PDF)
D. Harwath, A. Recasens, D. Suris, G. Chuang, A. Torralba, and J. Glass, "Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input," Proc. of the 2018 European Conference on Computer Vision (ECCV), Munich, Germany, September 2018 (PDF)
This corpus only includes audio recordings, and not the associated images. You will need to separately download the Places image dataset here.
If you use this data in your own publications, please cite the papers above as well as the relevant publications listed on the Places website.
This data is distributed under the Creative Commons Attribution-ShareAlike (CC BY-SA) license (link).