Antonio Torralba

Thomas and Gerd Perkins Professor of Electrical Engineering and Computer Science.

Head AI+D faculty, EECS dept. (link)

Computer Science and Artificial Intelligence Laboratory - Dept. of Electrical Engineering and Computer Science
Massachusetts Institute of Technology

Office: 32-386G
32 Vassar Street, Cambridge, MA 02139
Email: torralba@mit.edu
Assistant: Fern DeOliveira Keniston

My research is in the areas of computer vision, machine learning and human visual perception. I am interested in building systems that can perceive the world like humans do. Although my work focuses on computer vision I am also interested in other modalities such as audition and touch. A system able to perceive the world through multiple senses might be able to learn without requiring massive curated datasets. Other interests include understanding neural networks, common-sense reasoning, computational photography, building image databases, ..., and the intersections between visual art and computation.

Lab Members

Jonas Wulff
Postdoc

Tianmin Shu
Postdoc

Chuang Gan
IBM Researcher

Bernat Felip i Diaz
Visiting Student

Mireia Hernandez Caralt
Visiting Student

Ching-Yao Chuang
Grad Student

David Bau
Grad Student

Joanna Materzynska
Grad Student

Manel Baradad
Grad Student

Nadiia Chepurko
Grad Student

Pratyusha Sharma
Grad Student

Sarah Schwettmann
Grad Student

Shuang Li
Grad Student

Tongzhou Wang
Grad Student

Wei-Chiu Ma
Grad Student

Xavier Puig Fernandez
Grad Student

Yunzhu Li
Grad Student

Ethan Weber
MEng Student

Jingwei Ma
MEng Student

Mahi Elango
MEng Student

Christine Yejin You
MEng Student

Ioannis Kaklamanis
UROP Student

Sam Boshar
UROP Student

Past Students and Postdocs

Adrià Recasens (Graduated 2019), Hang Zhao (Graduated 2019), Jun-Yan Zhu (Postdoc), Bolei Zhou (Graduated 2018), Carl Vondrick (Graduated 2017), Javier Marin (Postdoc), Yusuf Aytar (Postdoc) Andrew Owens (Graduated 2016), Aditya Khosla (Graduated 2016), Agata Lapedriza (Visiting professor, UOC), Joseph J. Lim (Graduated 2015), Lluis Castrejon (Visiting student, 2015), Hamed Pirsiavash (Postdoc), Zoya Gavrilov (Grad. Student). Josep Marc Mingot Hidalgo (Visiting student), Tomasz Malisiewicz (Postdoc), Jianxiong Xiao (Graduated 2013), Dolores Blanco Almazan (Visiting student, 2012), Biliana Kaneva (Graduated 2011), Jenny Yuen (Graduated 2011), Tilke Judd (Graduated 2011) Myung "Jin" Choi (Graduated 2011), James Hays (Postdoc), Hector J.Bernal (Visiting student), Gunhee Kim (Visiting student), Bryan C. Russell (Graduated 2008).

Research

It is all about context!

Scene understanding and context driven object recognition.

Integration of vision, audition and touch (and smell!): perceiving the world via multiple senses. I would like to study computer vision in the context of other perceptual modalities.

Building datasets: AI is an empirical science. Measuring the world is an important part of asking questions about perception and building perceptual models. I am interested in building datasets with complex scenes, with objects in context and multiple perceptual modalities.

Dissecting neural networks: visualization and interpretation of the representation learned by neural networks. GAN dissection and Network dissection.

News

2020 - Named the head of the faculty of artificial intelligence and decision-making (AI+D). AI+D is a new unit within EECS, which brings together machine learning, AI and decision making, while keeping strong connections with its roots in EE and CS. This unit focuses on faculty recruiting, mentoring, promotion, academic programs, and community building.

2018 - 2020 MIT Quest for intelligence: I have been named inaugural director of the MIT Quest for Intelligence. The Quest is a campus-wide initiative to discover the foundations of intelligence and to drive the development of technological tools that can positively influence virtually every aspect of society.

2017 - 2020 MIT IBM Watson AI lab: named the MIT director of the MIT IBM Watson AI lab.

Cool news

Late show with Stephen Colbert on the work by Carl and Hamed, Anticipating Visual Representations from Unlabeled Video. CVPR 2016.

The Marilyn Monroe/Albert Einstein hybrid image by Aude Oliva on BBC.

German TV science show on accidental cameras. Details about accidental cameras and some of our videos are available here.


Datasets

Virtual Home (2019). VirtualHome is a platform to simulate complex household activities via programs. Key aspect of VirtualHome is that it allows complex interactions with the environment, such as picking up objects, switching on/off appliances, opening appliances, etc. Our simulator can easily be called with a Python API: write the activity as a simple sequence of instructions which then get rendered in VirtualHome. You can choose between different agents and environments, as well as modify environments on the fly. You can also stream different ground-truth such as time-stamped actions, instance/semantic segmentation, and optical flow and depth. Check out more details of the environment and platform in www.virtual-home.org.

Gaze 360 (2019). Understanding where people are looking is an informative social cue that machines need to understand to interact with humans. In this work, we present Gaze360, a large-scale gaze-tracking dataset and method for robust 3D gaze estimation in unconstrained images. Our dataset consists of 238 participants in indoor and outdoor environments with labelled 3D gaze across a wide range of head poses and distances.

The Places Audio Caption Corpus (2018). The Places Audio Caption 400K Corpus contains approximately 400,000 spoken captions for natural images drawn from the Places 205 image dataset. It was collected to investigate multimodal learning schemes for unsupervised co-discovery of speech patterns and visual objects.

ADE20K dataset (2017). 22.210 fully annotated images with over 430.000 object instances and 175.000 parts. All images are fully segmented with over 3000 object and part categories. A reduced version of the dataset is used for the scene parsing challenge.

Places database (2017). The database contains more than 10 million images comprising 400+ scene categories. The dataset features 5000 to 30,000 training images per class. More details appear in: "Learning Deep Features for Scene Recognition using Places Database," B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva. NIPS 2014 (pdf). The Places database has two releases: Places release 1, contains 205 scene categories and 2,5 million of images. Places release 2, contains 400 scene categories and 10 million of images. Pre-trained models available here.

CMPlaces (2016). CMPlaces is designed to train and evaluate cross-modal scene recognition models. It covers five different modalities: natural images, sketches, clip-art, text descriptions, and spatial text images. The dataset has thousands of linedrawings and textual descriptions of scenes. The dataset is organized with the same categories as the Places database. More details in paper.pdf

3D IKEA dataset (2013). In order to develop and evaluate fine pose estimation based on 3D models, we created a new dataset of images and 3D models representing typical indoor scenes. We collected IKEA 3D models from Google 3D Warehouse, and images from Flickr. This dataset contains about 759 images and 219 3D-models. All 759 images are annotated using available models (about 90 different models). Also, we separate the data into two different splits: IKEAobject and IKEAroom.

360-SUN Database (2012). A database of 360 degrees panoramas organized along the SUN categories. The pose of an object carries crucial semantic meaning for object manipulation and usage (e.g., grabbing a mug, watching a television). Just as pose estimation is part of object recognition, viewpoint recognition is an important component of scene recognition. For instance, a theater has a clear distinct distributions of objects – a stage on one side and seats on the other – that defines unique views in different orientations. The goal of this dataset was to study the viewpoint recognition problem in scenes.

Out of context objects (2012). The database contains 218 fully annotated images with at least one object out-of-context. Context models have been evaluated mostly based on the improvement of object recognition performance even though it is only one of many ways to exploit contextual information. Can you detect the out of context object? Detecting “out-of-context” objects and scenes is challenging because context violations can be detected only if the relationships between objects are carefully and precisely modeled. Project page

Indoor Scene Recognition Database (2009). The database contains 67 Indoor categories, and a total of 15620 images.

80 Million tiny images: explore a dense sampling of the visual world (2008). The web page shows a visual index of all the noums in WordNet. A portion of this dataset was manualy curated and used to create the CIFAR datasets. The resulting CIFAR dataset was used to develop some of the early neural nets around 2010. Although the images are too small to train working systems (images only have 32x32 pixels), the dataset is important for fundamental research.

LabelMe (2005). The goal of LabelMe is to provide an online annotation tool to build image databases for computer vision research. LabelMe started so long ago ... it is hard to believe it is still up an running. The code is avalable here: github.

8 scene categories database (2001). This dataset contains 8 outdoor scene categories: coast, mountain, forest, open country, street, inside city, tall buildings and highways. There are 2600 color images, 256x256 pixels.


Publications

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001