Full Dataset

Full-sized images and segmentations

Scene Parsing Benchmark

Scene parsing data and part segmentation data derived from ADE20K dataset could be download from MIT Scene Parsing Benchmark.

Training set
20.210 images (browse)

All images are fully annotated with objects and, many of the images have parts too.

Validation set
2.000 images (browse)

Fully annotated with objects and parts

Test set
Images to be released later.

Consistency set
64 images and annotations used for checking the annotation consistency (download)


Images and annotations:

Each folder contains images separated by scene category (same scene categories than the Places Database). For each image, the object and part segmentations are stored in two different png files. All object and part instances are annotated sparately.

For each image there are the following files:

  • *.jpg: RGB image.
  • *_seg.png: object segmentation mask. This image contains information about the object class segmentation masks and also separates each class into instances. The channels R and G encode the objects class masks. The channel B encodes the instance object masks. The function loadAde20K.m extracts both masks.
  • *_seg_parts_N.png: parts segmentation mask, where N is a number (1,2,3,...) indicating the level in the part hierarchy. Parts are organized in a tree where objects are composed of parts, and parts can be composed of parts too, and parts of parts can have parts too. The level N indicates the depth in the part tree. Level N=1 corresponds to parts of objects. All the part segmentations have the same encoding as in the object segmentation masks, classes are coded in the RG channels and instances in the B channel. Use the function loadAde20K.m to extract part segmentation mask and to separate instances of the same class.
  • *_.txt: text file describing the content of each image (describing objects and parts). This information is redundant with other files. But in addition contains also information about object attributes. The function loadAde20K.m also parses the content of this file. Each line in the text file contains: column 1=instance number, column 2=part level (0 for objects), column 3=occluded (1 for true), column 4=class name (parsed using wordnet), column 5=original raw name (might provide a more detailed categorization), column 6=comma separated attributes list.
  • The following example has two part levels. The first segmentation shows the object masks. The second segmentation corresponds to object parts (body parts, mug parts, table parts, ...). The third segmentation shows parts of the heads (eyes, mouth, nose, ...):

    Matlab file: index_ade20k_2015.mat

  • filename: cell array of length N=22210 with the image file names.
  • folder: cell array of length N with the image folder names.
  • scene: cell array of length N providing the scene name (same classes as the Places database) for each image.
  • objectnames: cell array of length C with the object class names.
  • wordnet_found: array of length C. It indicates if the objectname was found in Wordnet.
  • wordnet_hypernym: cell array of length C. WordNet hypernyms for each object name.
  • wordnet_gloss: cell array of length C. WordNet definition.
  • objectcounts: array of length C with the number of instances for each object class.
  • objectPresence: array of size [length C, N] with the object counts per image. objectPresence(c,i)=n if in image i there are n instances of object class c.
  • objectIsPart: array of size [length C, N] counting how many times an object is a part in each image. objectIsPart(c,i)=m if in image i object class c is a part of another object m times. For objects, objectIsPart(c,i)=0, and for parts we will find: objectIsPart(c,i) ≈ objectPresence(c,i).
  • proportionClassIsPart: array of length C with the proportion of times that class c behaves as a part. If proportionClassIsPart(c)=0 then it means that this is a main object (e.g., car, chair, ...). See bellow for a discussion on the utility of this variable.
  • Matlab tools:

    To load a segmentation use loadAde20K.m which will return the segmentation masks for objects and its parts:

    [Om, Oi, Pm, Pi, objects, parts] = loadAde20K(filename)

    Segmentation masks:

  • Om = [n * m]: object class mask. Each pixel in the mask, Om(i,j)=c, contains the object class index: objectnames{Om(i,j)} gives the object name at location (i,j). Om(i,j)=0 for unlabeled pixels.

  • Oi = [n * m]: object instance mask. Each distinct object (even if they belong to the same class) has a different index.

  • Pm = [n * m * Nlevels]: part class mask. Each pixel in the mask, Pm(i,j,n)=c, contains the part class index: objectnames{Pm(i,j,n)} gives the part name at location (i,j) and level=n.

  • Pi = [n * m * Nlevels]: part instance mask.

  • Objects properties:

  • objects.instancendx: index inside 'ObjectInstanceMasks'
  • objects.class: object name
  • objects.iscrop: indicates if the objects is whole (iscrop=0) or partially visible (iscrop=1)
  • objects.listattributes: comma separated list of attributes such as 'sitting', ...
  • Part properties:

  • parts.instancendx: index inside 'PartsInstanceMasks(:,:,level)'
  • parts.level: level in the part hierarchy. Level = 1 means that is a direct object part. Level = 2 means that is part of a part.
  • parts.class: part name
  • parts.iscrop: whole (iscrop=0) or partially visible (iscrop=1)
  • parts.listattributes: comma separated list of attributes such as 'sitting', ...
  • Check demo.m to see examples of how to extract segments and parts, and how to index the object names.


    The annotated images cover the scene categories from the SUN and Places database. Here there are some examples showing the images, object segmentations, and parts segmentations:

    You can browse the rest of the images here: ADE20K browser.

    The next visualization provides the list of objects and parts and the number of annotated instances. The tree only shows objects with more than 250 annotated instances and parts with more than 10 annotated instances.

    Some classes can be both objects and parts. For instance, a "door" can be an object (in an indoor picture), or a part (when it is the "door" of a "car"). Some objects are always parts (e.g., a "leg", a "hand", ...), although, in some cases they can appear detached of the whole (e.g., a car "wheel" inside a garage), and some object are never parts (e.g., a "person", a "truck", ...). The same name class (e.g., "door") can correspond to several visual categories depending on which object it is a part of. For instance a car door is visually different from a cabinet door or a building door. However they share similar affordances. The value proportionClassIsPart(c) can be used to decide if a class behaves mostly as an object or as a part. When an object is not part of another object its segmentation mask will appear inside *_seg.png. If the class behaves as a part, then the segmentation mask will appear inside *_seg_parts.png. Correctly detecting an object requires classifying if the object is behaving as an independent object or if it is a part of another object.


    Use the validation set to evaluate your algorithm. You can use the evaluation package for the scene parsing challenge.

    Dataset bias

    In the training set:

  • The median aspect ratio of the images is 4/3.
  • The median image size is 307200 pixels. The average image size is 1.3Mpixels.
  • The mode of the object segmentations is shown below and contains the four objects (from top to bottom): 'sky', 'wall', 'building' and 'floor'.
  • The mode of the part segmentations has two classes: 'window' and 'door'.
  • In the test set:

  • When using simply the mode to segment the images it gets, on average, 20.3% of the pixels of each image right in the validation set.
  • The Intersection over Union (IoU), on the validation set, for the four classes present in the segmentation mode is:
  • Sky Wall Building Floor
    IoU 0.20 0.19 0.07 0.18

    Annotation noise analysis

    To analyze the annotation consistency we took a subset of 64 randomly chosen images from the validation set, and asked our annotator to annotate them again (there is a time difference of six months). 20 of those images were also annotated by two external annotators. One expects that there will be some differences between two annotations, even when the task is done by the same person. On average, 82% of the pixels got the same label. The following figure shows a picture and two segmentations done by the same annotator:

    The set of 64 images annotated several times are available here:


    If you find this dataset useful, please cite the following publication:

    Scene Parsing through ADE20K Dataset. Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso and Antonio Torralba. Computer Vision and Pattern Recognition (CVPR), 2017. [PDF] [bib]

    Semantic Understanding of Scenes through ADE20K Dataset. Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso and Antonio Torralba. arXiv:1608.05442. [PDF] [bib]