Multi-Sensory Perceptive Systems: Human and Machine Processing of Multi-Modal Data

Multi-Sensory Perceptive Systems: Human and Machine Processing of Multi-Modal Data

Presentation Abstracts

Saturday
December 8th

Schedule/Format

Presentation Abstracts

Organizers &
Participants

Demos &
Related Links

Morning Session (7:30 AM - 10:30 AM)

Biological Studies of Multi-modal Perception

Salvador Soto-Faraco University of Barcelona	TBA
Jochen Triesch University of California San Diego	Democratic integration: self-organized integration of adaptive sensory cues The integration of information provided by different sensors, cues, and modalities into useful percepts is among the most fundamental problems that biological organisms and robots have to face. Since all cues are generally noisy and often prone to failure, their reliability for the task at hand has to be constantly re-estimated. I will present a new architecture for adaptively integrating different cues in a self-organized manner. In Democratic Integration different cues agree on a result, while at the same time each cue adapts to the result being agreed on. This leads to automatic suppression and re-calibration of discordant cues. I will first present a face tracking system implementing the idea. Experiments demonstrate benefits of the scheme, in particular robustness to sudden changes in the environment and the potential usefulness of cues for which no a priori knowledge about the task had been formulated. Then I will describe a psychophysical experiment aimed at testing the Democratic Integration hypothesis for visual cue integration in Humans. The results suggest that observers can adapt their visual cue combination strategies on a very fast time-scale (about one second).
Hans Colonius Universität Oldenburg	Spatio-temporal rules of visual-auditory interaction in saccadic eye movements Subjects are asked to make a saccadic eye movement to a suddenly appearing visual target while ignoring an auditory (distractor) stimulus presented in spatio-temporal contiguity (focused attention task). Compared to the unimodal visual condition, saccadic onset time in the bimodal condition is reduced depending on the specific spatial and temporal relation between visual and auditory stimulus. The data from various experiments are described by a two-stage model of multisensory integration which is consistent with recent neurophysiological findings on the control of eye movements.
Robert Jacobs University of Rochester	Learning to see in three dimensions Why is seeing the world in three dimensions so easy? We believe that this ease is due to the fact that the visual world is highly redundant; there are many cues to perceptual properties such as depth and shape. However, combining information from multiple cues in an effective manner is non-trivial. We argue that people must learn their cue combination strategies on the basis of experience. In particular, we address the question of whether or not observers can adapt their visual cue combination strategies on the basis of consistencies between visual and haptic (touch) percepts. Berkeley (1709), Piaget (1952), and many others speculated that people calibrate their interpretations of visual cues on the basis of their motor interactions with objects in the world. Despite the intuitive appeal of this hypothesis, it has never been adequately tested. Using a novel virtual reality environment, we have conducted three experiments whose results suggest that observers adapt their visual cue combination strategies based on correlations between visual and haptic percepts.

Applications using Audio-Video Fusion

Jim Rehg Georgia Institute of Technology	Analysis of complex audio-visual events using spatially Distributed Sensors
Michael Harville Hewlett-Packard Laboratories	Multi-modal perceptual interfaces in the social media project In our increasingly mobile and active society, people often find themselves physically separated from their friends and families, and with fewer large blocks of time to arrange and participate in social activities with them. The proliferation of technology is in many ways responsible for this, but it may also help provide some solutions. In the Social Media project at HP Labs, we are building an infrastructure that aims to make virtual socialization among remotely located people a more seamlessly organized and truly enjoyable experience. Audio and video media play many roles in this project: 1) media such as music and television can provide a context around which to socialize; 2) audio and video allow for rich, expressive forms of communication; and 3) audio and visual machine perception enable the creation of interfaces that demand little effort of the participants. This last factor is critical in making the personal interactions, rather than the intermediating technology, the focus of the virtual socialization experience. I will first briefly describe some of the application scenarios and technology components that are currently the focus of Social Media. I will then discuss some of the more novel visual and audio perception components that we hope to integrate into the project. This includes various types of person detection, recognition, tracking, and activity analysis based on stereo vision, as well as some work on low-power and distributed approaches to classic audio perception problems. Finally, I will discuss some of the multi-modal perception integration scenarios we are considering, and some of the issues we are encountering. An important theme is that some of the suite of devices on which Social Media is implemented (such as video displays and desktop computers) will be in fixed environment locations, while others (such as PDAs and cellphones) will mobile. We will want these devices to share with each other some of the various sorts of sensory information they acquire, and we must decide what levels of representation are best to share, how to integrate the various spatial coordinate systems involved, how best to allow one source of perception to guide another, on what devices to do the computation, and so forth.
Trevor Darrell MIT Artificial Intelligence Laboratory	Integrated audio/video sensor arrays

Afternoon Session (4:00 PM - 7:00 PM)

Statistical Methods of Fusion

Andrew Blake Microsoft Research	Integrated tracking with vision and sound Stereo sound and vision are complementary modalities in that sound is good for initialisation (where vision is expensive) whereas vision is good for localisation (where sound is less precise). Using generative probabilistic models and particle filtering, we show that stereo sound and vision can indeed be fused effectively, to make a system more capable than with either modality on its own.
John Hershey University of California San Diego	Audio-Visual sound separation using hidden markov models
Matthew Beal Gatsby Computational Neuroscience Unit	Bayesian combination of audio and video modalities We present a self-calibrating algorithm for audio-visual tracking using two microphones and a camera. The algorithm uses a parametrized statistical model which combines simple models of video and audio. Using unobserved variables, the model describes the process that generates the observed data. Hence, it is able to capture and exploit the statistical structure of the audio and video data, as well as their mutual dependencies. The model parameters are estimated by the EM algorithm; object templates are learned and automatic calibration is performed as part of this procedure. Tracking is done by Bayesian inference of the object location using the model. Successful performance is demonstrated on real multimedia clips.

Multi-Modal Perception Models

John Jeka University of Maryland	Properties of multisensory fusion for human spatial orientation The advantage of multisensory information is often expressed in terms of signal enhancement or resolution. Multisensory signals are more easily detected than information from a single sensory source. However, when placed into the context of perception linked to the control of body movement, multisensory fusion has the additional advantage of resolving ambiguities between movements of different body components. Human self-orientation requires multisensory fusion not for signal enhancement, but for a collective characterization of multi-linked body dynamics. This characterization entails two basic processes: estimation and control. Dynamic characteristics of body sway identified with time series models have led to measures that distinguish estimation from control. Using these measures to then develop models that hypothesize underlying mechanisms has resulted in two important findings. First, most of the variability observed in body sway can be linked to the process of estimating the center of mass from multisensory information. Standard control theory algorithms cannot account for how multisensory information is fused for center of mass estimation without a process we refer to as
Javier Movellan University of California San Diego	Information integration in humans and machines We present ongoing work at UCSD's Machine Perception Laboratory, trying to relate the psychophysics of human perception and the development of perceptual machines that use multimodal information. This research program proceeds by finding constraints under which psychophisical regularities are optimal, analyzing processing models compatible with such constraints, and testing whether machine perception systems that adhere to such constraints perform well. An illustration of the approach will be offered for the problem of audio-visual speech perception and a psychophisical regularity known as the Morton-Massaro law.
Thomas Anastasio University of Illinois, Urbana-Champagne	A probabilistic framework for understanding multisensory enhancement in the superior colliculus The superior colliculus is organized topographically as a neural map. The deep layers of the colliculus detect and localize targets in the environment by integrating input from multiple sensory systems. Some deep colliculus neurons receive input of only one sensory modality (unimodal) while others receive input of multiple modalities. Multimodal deep SC neurons exhibit multisensory enhancement, in which the response to input of one modality is augmented by input of another modality. Multisensory enhancement is magnitude dependent in that combinations of smaller single-modality responses produce larger amounts of enhancement. These findings are consistent with the hypothesis that deep colliculus neurons use sensory input to compute the probability that a target has appeared at their corresponding location in the environment. Multisensory enhancement and inverse effectiveness can be simulated using a model in which sensory inputs are random variables and target probability is computed using Bayes' Rule. Informational analysis of the model indicates that input of another modality can indeed increase the amount of target information received by a multimodal neuron, but only if input of the initial modality is ambiguous. Unimodal deep colliculus neurons may receive unambiguous input of one modality and have no need of input of another modality.
Shihab Shamma University of Maryland	TBA

gregory@ai.mit.edu

Copyright (c) 2001, John W. Fisher III. All rights reserved.