Antonio Torralba

Delta Electronics Professor of Electrical Engineering and Computer Science.

Head AI+D faculty, EECS dept. (link)

Computer Science and Artificial Intelligence Laboratory - Dept. of Electrical Engineering and Computer Science
Massachusetts Institute of Technology

Office: 32-386G
32 Vassar Street, Cambridge, MA 02139
Assistant: Fern DeOliveira Keniston

My research is in the areas of computer vision, machine learning and human visual perception. I am interested in building systems that can perceive the world like humans do. Although my work focuses on computer vision I am also interested in other modalities such as audition and touch. A system able to perceive the world through multiple senses might be able to learn without requiring massive curated datasets. Other interests include understanding neural networks, common-sense reasoning, computational photography, building image databases, ..., and the intersections between visual art and computation.

Lab Members

Adrián Rodríguez
Grad student

George Cazenavette
Grad Student

Joanna Materzynska
Grad Student

Kabir Swain
Grad Student

Krishna Murthy

Manel Baradad
Grad Student

Pratyusha Sharma
Grad Student

Sarah Schwettmann
Research Scientist

Shivam Duggal
Grad Student

Tongzhou Wang
Grad Student

Yichen Li
Grad student

Past Students and Postdocs

Wei-Chiu Ma (Graduated 2023), Shuang Li (Graduated 2023), Ching-Yao Chuang (Graduated 2023), Tianmin Shu (Postdoc 2023), Hengshuang Zhao (Postdoc 2022), Xavier Puig Fernandez (Graduated 2022), Yunzhu Li (Graduated 2022), Nadiia Chepurko (Grad. Student), Ali Jahanian (Research scientist), David Bau (Graduated 2021), Dim P. Papadopoulos (Postdoc), Jonas Wulff (Postdoc), Adrià Recasens (Graduated 2019), Hang Zhao (Graduated 2019), Jun-Yan Zhu (Postdoc), Bolei Zhou (Graduated 2018), Carl Vondrick (Graduated 2017), Javier Marin (Postdoc), Yusuf Aytar (Postdoc) Andrew Owens (Graduated 2016), Aditya Khosla (Graduated 2016), Agata Lapedriza (Visiting professor, UOC), Joseph J. Lim (Graduated 2015), Lluis Castrejon (Visiting student, 2015), Hamed Pirsiavash (Postdoc), Zoya Gavrilov (Grad. Student). Tomasz Malisiewicz (Postdoc), Jianxiong Xiao (Graduated 2013), Biliana Kaneva (Graduated 2011), Jenny Yuen (Graduated 2011), Tilke Judd (Graduated 2011) Myung "Jin" Choi (Graduated 2011), James Hays (Postdoc), Bryan C. Russell (Graduated 2008).


Foundations of Computer Vision
with Phillip Isola and Bill Freeman
MIT press

Our book is finished!

Lots of things have happened since we started thinking about this book in November 2010; yes, it has taken us more than 10 years to write this book. Our initial goal was to write a large book that provided a good coverage of the field. Unfortunately, the field of computer vision is just too large for that. So, we decided to write a small book instead, limiting each chapter to no more than five pages. Writing a short book was perfect because we did not have time to write a long book and you did not have time to read it. Unfortunately, we have failed at that goal, too. This book covers foundational topics within computer vision, with an image processing and machine learning perspective. The audience is undergraduate and graduate students who are entering the field, but we hope experienced practitioners will find the book valuable as well.


It is all about context!

Scene understanding and context driven object recognition.

Integration of vision, audition and touch (and smell!): perceiving the world via multiple senses. I would like to study computer vision in the context of other perceptual modalities.

Building datasets: AI is an empirical science. Measuring the world is an important part of asking questions about perception and building perceptual models. I am interested in building datasets with complex scenes, with objects in context and multiple perceptual modalities.

Dissecting neural networks: visualization and interpretation of the representation learned by neural networks. GAN dissection and Network dissection.


2020 - Named the head of the faculty of artificial intelligence and decision-making (AI+D). AI+D is a new unit within EECS, which brings together machine learning, AI and decision making, while keeping strong connections with its roots in EE and CS. This unit focuses on faculty recruiting, mentoring, promotion, academic programs, and community building.

2018 - 2020 MIT Quest for intelligence: I have been named inaugural director of the MIT Quest for Intelligence. The Quest is a campus-wide initiative to discover the foundations of intelligence and to drive the development of technological tools that can positively influence virtually every aspect of society.

2017 - 2020 MIT IBM Watson AI lab: named the MIT director of the MIT IBM Watson AI lab.

Cool news

March 2022, I was awarded the Honoris Causa by UPC. I graduated from UPC in 1994.

Late show with Stephen Colbert on the work by Carl and Hamed, Anticipating Visual Representations from Unlabeled Video. CVPR 2016.

The Marilyn Monroe/Albert Einstein hybrid image by Aude Oliva on BBC.

German TV science show on accidental cameras. Details about accidental cameras and some of our videos are available here.


Virtual Home (2019). VirtualHome is a platform to simulate complex household activities via programs. Key aspect of VirtualHome is that it allows complex interactions with the environment, such as picking up objects, switching on/off appliances, opening appliances, etc. Our simulator can easily be called with a Python API: write the activity as a simple sequence of instructions which then get rendered in VirtualHome. You can choose between different agents and environments, as well as modify environments on the fly. You can also stream different ground-truth such as time-stamped actions, instance/semantic segmentation, and optical flow and depth. Check out more details of the environment and platform in

Gaze 360 (2019). Understanding where people are looking is an informative social cue that machines need to understand to interact with humans. In this work, we present Gaze360, a large-scale gaze-tracking dataset and method for robust 3D gaze estimation in unconstrained images. Our dataset consists of 238 participants in indoor and outdoor environments with labelled 3D gaze across a wide range of head poses and distances.

The Places Audio Caption Corpus (2018). The Places Audio Caption 400K Corpus contains approximately 400,000 spoken captions for natural images drawn from the Places 205 image dataset. It was collected to investigate multimodal learning schemes for unsupervised co-discovery of speech patterns and visual objects.

ADE20K dataset (2017). 22.210 fully annotated images with over 430.000 object instances and 175.000 parts. All images are fully segmented with over 3000 object and part categories. A reduced version of the dataset is used for the scene parsing challenge.

Places database (2017). The database contains more than 10 million images comprising 400+ scene categories. The dataset features 5000 to 30,000 training images per class. More details appear in: "Learning Deep Features for Scene Recognition using Places Database," B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva. NIPS 2014 (pdf). The Places database has two releases: Places release 1, contains 205 scene categories and 2,5 million of images. Places release 2, contains 400 scene categories and 10 million of images. Pre-trained models available here.

CMPlaces (2016). CMPlaces is designed to train and evaluate cross-modal scene recognition models. It covers five different modalities: natural images, sketches, clip-art, text descriptions, and spatial text images. The dataset is organized with the same categories as the Places database. More details in paper.pdf

Out of context objects (2012). The database contains 218 fully annotated images with at least one object out-of-context. Context models have been evaluated mostly based on the improvement of object recognition performance even though it is only one of many ways to exploit contextual information. Can you detect the out of context object? Detecting “out-of-context” objects and scenes is challenging because context violations can be detected only if the relationships between objects are carefully and precisely modeled. Project page

LabelMe (2005). The goal of LabelMe is to provide an online annotation tool to build image databases for computer vision research. LabelMe started so long ago ... it is hard to believe it is still up an running.

8 scene categories database (2001). This dataset contains 8 outdoor scene categories: coast, mountain, forest, open country, street, inside city, tall buildings and highways. There are 2600 color images, 256x256 pixels.


Google scholar