Summary of Results

2014-2015

Washington University

We released a website that supports the public ability to geo-locate image data. This website, http://geocalibration.org, is a human-in-the-loop tool matching image features to geo-locations (perhaps found on google maps). This was used in the search for the exact grave site based on funeral pictures of a Jane Doe crime victim buried 30 years ago, allowing police to exhume the body and get their first new leads in 30 years, as described in "Finding Jane Doe: a forensic application of 2D image calibration", and "Images Don't Forget: Online Photogrammetry to Find Lost Graves".

MIT

We generalized the well-known aperture problem to the case when objects are refractive. We found a relationship between the degree of distortion and the ability to recover parallel and perpendicular motion, and demonstrated this relationship with experiments.
We built a system for generating continuous shape-time images from RGB-D videos.
We found that it is possible to accurately estimate and visualize subpixel deviations from idealized geometric shapes for real-world objects, despite of aliasing and other sources of errors.
We found that it is possible to magnify subtle motions in unstabilized video by amplifying deviations from idealized structures. In some cases, these magnifications are substantially better than phase-based magnification methods.

Cornell

We have explored three major themes:
1. Reconstructing time-varying scenes in 4D from Internet photos. Our work on "Scene Chronology" showed how to take hundreds of thousands or millions of photos of a time-varying place like Times Square, and automatically create a 4D model---a 3D model with a time slider that captures the changes in the scene geometry and appearance across time. This work was based on a new scalable and robust algorithm for taking photos taken at different times and places, with noisy or missing timestamps, and inferring both the 3D structure and the time span during which each object in the scene existed. The resulting 4D model can also be used to approximately timestamp new photos by using constraints based on which objects are visible in the image. This work appeared at ECCV 2014 and received the best paper award.
2. Estimating scene reflectance and illumination from large-scale image collections. We have developed new methods for automatically decomposing scenes into intrinsic reflectance and per-image illumination maps from large, uncalibrated Internet photo collections. Our new innovation is to use the statistics of outdooor illumination, as predicted from computer graphics models of sun and sky illumination, and connect these statistics with observations of the scene made across thousands of photos taken at unknown times. A key innovation is to model the effect on illumination of the environment being occluded at each point (known in computer graphics as "ambient occlusion"). Ours is one of the first algorithms in computer vision for explicitly modeling ambient occlusion.
3. Algorithms and datasets for automatically "grounding" photos in the world, by geo-tagging and time-stamping them. We have created new ways to automatically tell where and when a photo was taken. Our geo-tagging method is based on matching photos to a large, world-wide 3D model of places around the globe created using structure from motion techniques (a "world-wide point cloud" containing hundreds of millions of points with appearance descriptors). To make this approach scale, we have developed new methods for creating structure-from-motion models at scale for challenging scenes around the world (many of which have repeated structures that confuse normal reconstruction algorithms), as well as new methods for compressing 3D models while retaining as much useful information as possible. Using the resulting compressed world-wide point cloud, we very quickly match an image and estimate its pose to precisely geolocate it.
We have also developed ways to use our models of scene illumination and appearance over time to automatically timestamp photos. Using transient scene elements we can approximately date a photo, and using illumination (i.e., sun direction) we can compute the time it was taken.

Harvard University

We presented a technique to recover geometry from time-lapse sequences of outdoor scenes. We built upon photometric stereo techniques to recover approximate shadowing, shading and normal components allowing us to alter the material and normals of the scene. We developed methods to estimate the reflection component due to skylight illumination. We also showed that sunlight directions are usually non-planar, thus making surface normal recovery possible. This allowed us to estimate approximate surface normals for outdoor scenes using a single day of data. We demonstrated the use of these surface normal for a number of image editing applications including reflectance, lighting, and normal editing.

2013-2014

Washington University

We solved for the geometric relationship that relates the path of a tracked shadow to the shape of the surfaces onto which the shadow is cast, the time of day and day of the year the image was taken and the 3D position of the shadow casting object.
We show that image features can be matched over long time periods, but are less likely to match at different times of day and different times of year, most likely due to changes in shadows. We found that this is largely due to feature detectors (e.g. the DoG detector in SIFT) not being consistent, so that features are not computed in the same place, rather than the feature descriptor changing and not being correctly matched.

MIT

We address a fundamental property of the temporal structure of images, and ask: are image temporal statistics symmetric in time? Equivalently, can we see the arrow of time? – can we distinguish a video playing forward from one playing backward? We probe this problem in three different ways with certain degree of success: a bag of flow-word model for local motion patterns, a motion causality model; and an auto-regressive causality model.
We present a new compact image pyramid representation, the Riesz pyramid, that can be used for real-time phase-based motion magnification. Our new representation is less overcomplete than even the smallest two orientation, octave-bandwidth complex steerable pyramid, and can be implemented using compact, efficient linear filters in the spatial domain. Motion-magnified videos produced with this new representation are of comparable quality to those produced with the complex steerable pyramid.
We show how, using only high-speed video of the object, we can extract those minute vibrations and partially recover the sound that produced them, allowing us to turn everyday objects—a glass of water, a potted plant, a box of tissues, or a bag of chips—into visual microphones. We also explore how to leverage the rolling shutter in regular consumer cameras to recover audio from standard frame-rate videos, and use the spatial resolution of our method to visualize how sound-related vibrations vary over an object’s surface, which we can use to recover the vibration modes of an object.
Sudden temporal depth changes, such as cuts that are introduced by video edits, can significantly degrade the quality of stereoscopic video. This is because the eye vergence has to constantly adapt to new disparities in spite of conflicting accommodation requirements. Such rapid disparity changes may lead to confusion and reduced understanding of the scene. To better understand this limitation of the human visual system, we conducted a series of eye-tracking experiments. The data obtained allowed us to derive and evaluate a model describing adaptation of vergence to disparity changes on a stereoscopic display.
We provide a solution that takes a stereoscopic video as an input and converts it to multi-view video streams. The method combines a phase-based video magnification and an interperspective antialiasing into a single filtering process. The whole algorithm is simple and can be efficiently implemented on current GPUs to yield a near real-time performance. Our method is robust and works well for challenging video scenes with defocus blur, motion blur, transparent materials, and specularities. We show that our results are superior when compared to the state-of-the-art depth-based rendering methods.

Cornell University

We created a new approach to understanding objects in real-world scenes by grounding these scenes in the real world and solving for precise camera viewpoints. Through this work, we created a new dataset of photos of Times Square from many different viewpoints and times, all accurately georegistered with precise camera locations and orientation, and with scene annotations derived from public GIS information (street and sidewalk locations, terrain models, etc.). We used this dataset to reason about objects in real-world settings, and have made this data freely available.
We developed a new statistical approach for reasoning about scene appearance and illumination from unstructured photo collections of outdoor scenes. Our method uses sun/sky models of outdoor illumination, developed in the computer graphics community, to derive illumination statistics for a particular place on Earth, and couple these with pixel statistics derived from imagery to derive information about the albedo and local visibility at each scene point.
We show that we can take large, unstructured photos of a temporally varying scene, such as Times Square, and compute the chronology of that scene -- i.e., what existed where, and for how long -- from the images as well as noisy time stamps on each photo. We further use this technique to show how to derive a probability distribution over the time when a particular photo was taken.

Harvard University

We created a new approach to understanding objects in real-world scenes by grounding these scenes in the real world and solving for precise camera viewpoints. Through this work, we created a new dataset of photos of Times Square from many different viewpoints and times, all accurately georegistered with precise camera locations and orientation, and with scene annotations derived from public GIS information (street and sidewalk locations, terrain models, etc.). We used this dataset to reason about objects in real-world settings, and have made this data freely available.
We developed a new statistical approach for reasoning about scene appearance and illumination from unstructured photo collections of outdoor scenes. Our method uses sun/sky models of outdoor illumination, developed in the computer graphics community, to derive illumination statistics for a particular place on Earth, and couple these with pixel statistics derived from imagery to derive information about the albedo and local visibility at each scene point.
We show that we can take large, unstructured photos of a temporally varying scene, such as Times Square, and compute the chronology of that scene -- i.e., what existed where, and for how long -- from the images as well as noisy time stamps on each photo. We further use this technique to show how to derive a probability distribution over the time when a particular photo was taken.

2012-2013

Washington University

We have experimentally shown that the use of solar illumination direction as a photometric stereo light source is possible, but requires good estimation of the non-linear camera response function, and requires images captured over a time scale of months in order to get a wide enough variation in the illumination direction so the surface normal estimation problem is well conditioned.
We derived a new constraint, the “epi-solar” constraint that make use of geo-calibrated cameras with image sequences where each image has an accurate time stamp. This scenario allows one to search along a line for correspondences between shadow edges and the point in the scene that casts the shadow, and we give an algorithm that links many such constraints into a complete depth map.
As an initial feasibility study, we have shown that large scale databases of geo-located webcams that are already in place can be re-purposed to characterize movement of people in public spaces that relate to public-health and exercise.
We demonstrated the ability to create approximations to the satellite cloud maps, based entirely on data from thousands of ground level webcams.

MIT

Our new SIGGRAPH paper proposes a new pipeline for editing small motions in video sequences, based on an analysis of motion in complex-valued image pyramids. This new framework is still “Eulerian”, meaning it does not require motion estimation (and is thus efficient and robust), but uses a more explicit representation of the motion than in our previous Eulerian Video Magnification work. We demonstrated how this new pipeline achieves much better motion magnification results both in theory and in practice. It also supports new applications such as motion attenuation, to efficiently remove distracting motion such as heat or atmospheric turbulence.
We found that this new pipeline can also successfully create views of differing visual disparities, given an input stereo video sequence. This is an important requirement for 3D television displays: given an input stereo sequence, it is required that images of multiple different disparities be generated on-the-fly, in order to send the appropriate images to the viewers of the 3D display.
We found it was possible to synthesize images of different times of day, given a single input photograph, and a dataset of approximately 1000 timelapse sequences.

Cornell University

Our new SIGGRAPH paper proposes a new pipeline for editing small motions in video sequences, based on an analysis of motion in complex-valued image pyramids. This new framework is still “Eulerian”, meaning it does not require motion estimation (and is thus efficient and robust), but uses a more explicit representation of the motion than in our previous Eulerian Video Magnification work. We demonstrated how this new pipeline achieves much better motion magnification results both in theory and in practice. It also supports new applications such as motion attenuation, to efficiently remove distracting motion such as heat or atmospheric turbulence.
We found that this new pipeline can also successfully create views of differing visual disparities, given an input stereo video sequence. This is an important requirement for 3D television displays: given an input stereo sequence, it is required that images of multiple different disparities be generated on-the-fly, in order to send the appropriate images to the viewers of the 3D display.
We found it was possible to synthesize images of different times of day, given a single input photograph, and a dataset of approximately 1000 timelapse sequences.

Harvard University

Our new SIGGRAPH paper proposes a new pipeline for editing small motions in video sequences, based on an analysis of motion in complex-valued image pyramids. This new framework is still “Eulerian”, meaning it does not require motion estimation (and is thus efficient and robust), but uses a more explicit representation of the motion than in our previous Eulerian Video Magnification work. We demonstrated how this new pipeline achieves much better motion magnification results both in theory and in practice. It also supports new applications such as motion attenuation, to efficiently remove distracting motion such as heat or atmospheric turbulence.
We found that this new pipeline can also successfully create views of differing visual disparities, given an input stereo video sequence. This is an important requirement for 3D television displays: given an input stereo sequence, it is required that images of multiple different disparities be generated on-the-fly, in order to send the appropriate images to the viewers of the 3D display.
We found it was possible to synthesize images of different times of day, given a single input photograph, and a dataset of approximately 1000 timelapse sequences.

2011-2012

Washington University

Our ECCV paper develops an efficient method to factor long-term, outdoor image data into the geometric, illumination, and image device parameters, in ways that return geo-referenced surface normals, and 3D models accurate enough to support quantitative evaluations relative to ground truth scene shape. Key findings from this paper include the fact that (for a large class of webcam/ uncontrolled imagery), it is necessary to explicitly model the non-linear response function of the camera, and that imagery from very long term time-lapse (ie. months, not just days) is necessary to provide a sufficient span of lighting directions to infer scene geometry.

MIT

The findings of the Siggraph paper: that a simple, pyramid-based signal-processing approach allows for (a) visualization of the human pulse, as well as (b) exaggeration and visualization of small motion changes in a video sequence. We are further developing this tool, working on a real-time ipad or tablet implementation. This should provide a "magic window" on the world, we hope being a real-time tool to visualize small color and motion changes in the world.
The findings of the CVPR paper: We've developed a practical system to observe potentially wavelength-scale changes in surfaces, "in the wild". This type of measurement sensitivity is common in laboratory settings, on an optical table, for example. But this approach allows detection of very small changes in various real-world settings, as well.
The findings of the second CVPR paper: The core idea is to use a dense multi-camera array to construct a novel, dense 3D volumetric representation of the 3D space where each voxel holds an estimated intensity value and a confidence measure of this value. The problem of 3D structure and 3D motion estimation of a scene is thus reduced to a nonrigid registration of two volumes. Registering two dense 3D scalar volumes does not require recovering the 3D structure of the scene as a pre-processing step, nor does it require explicit reasoning about occlusions.
The latent factor model of human travel revealed interpretable properties, including travel distance, desirability of destinations, and affinity between locations.

Cornell University

Our work on camera pose estimation to be presented at ECCV 2012 shows that it is possible to precisely localize outdoor images in the wild by comparing them to a large database of structure-from-motion 3D models. This allows for determining not only position, but also orientation, focal length, and radial distortion parameters, and thus can automatically associate pixels in the image with rays in a geo-referenced coordinate system. The technical innovation in this problem involves efficiently matching a photograph to hundreds of millions of 3D points in the database. We are using this technique to derive simple ways to determine time of day from Internet photos.

Harvard University

The findings of the video color transfer project : In addition to issues related to transferring color in images, temporal coherence needs to be addressed. Exploiting original interpolation and filtering schemes, we smooth out spatio-temporal artifacts arising during the color transfer process, hence enforcing temporal coherence and increasing the overall transfer quality.

Supported by the National Science Foundation

About this project

Summary of Results

2014-2015

2013-2014

2012-2013

2011-2012