Spoken ObjectNet

Overview

Spoken ObjectNet (SON) is a corpus of 50,273 English spoken audio captions for the images in the ObjectNet dataset. For a description of the corpus, see:

Palmer, I., Rouditchenko, A., Barbu, A., Katz, B., Glass, J. (2021) Spoken ObjectNet: A Bias-Controlled Spoken Caption Dataset. Proc. Interspeech 2021, 3650-3654, doi: 10.21437/Interspeech.2021-245 (PDF)

Please note that the ArXiv version contains additional experiments on the Spoken ObjectNet test set.

This corpus only contains audio recordings, and not the associated images. You will need to separately download the ObjectNet image dataset as described on the downloads page.

If you use this data in your own publications, please cite the paper above as well as the original publication listed on the ObjectNet website.

This data is distributed under the Creative Commons Attribution-ShareAlike (CC BY-SA) license (link).