|
|
Morning Session     (7:30 AM - 10:30 AM)
Biological Studies of Multi-modal Perception
Salvador Soto-Faraco
University of Barcelona |
TBA
|
Jochen Triesch
University of California San Diego |
Democratic integration: self-organized integration of adaptive sensory cues
The integration of information provided by different
sensors, cues, and modalities into useful percepts is among the most
fundamental problems that biological organisms and robots have to
face. Since all cues are generally noisy and often prone to failure,
their reliability for the task at hand has to be constantly
re-estimated. I will present a new architecture for adaptively
integrating different cues in a self-organized manner. In Democratic
Integration different cues agree on a result, while at the same time
each cue adapts to the result being agreed on. This leads to automatic
suppression and re-calibration of discordant cues. I will first
present a face tracking system implementing the idea. Experiments
demonstrate benefits of the scheme, in particular robustness to sudden
changes in the environment and the potential usefulness of cues for
which no a priori knowledge about the task had been formulated. Then I
will describe a psychophysical experiment aimed at testing the
Democratic Integration hypothesis for visual cue integration in
Humans. The results suggest that observers can adapt their visual cue
combination strategies on a very fast time-scale (about one second).
|
Hans Colonius
Universität Oldenburg |
Spatio-temporal rules of visual-auditory interaction in saccadic eye
movements
Subjects are asked to make a saccadic eye movement to a
suddenly appearing visual target while ignoring an
auditory (distractor) stimulus presented in
spatio-temporal contiguity (focused attention
task). Compared to the unimodal visual condition,
saccadic onset time in the bimodal condition is reduced
depending on the specific spatial and temporal relation
between visual and auditory stimulus. The data from
various experiments are described by a two-stage model
of multisensory integration which is consistent with
recent neurophysiological findings on the control of eye
movements.
|
Robert Jacobs
University of Rochester |
Learning to see in three dimensions
Why is
seeing the world in three dimensions so easy? We believe that this
ease is due to the fact that the visual world is highly redundant;
there are many cues to perceptual properties such as depth and shape.
However, combining information from multiple cues in an effective
manner is non-trivial. We argue that people must learn their cue
combination strategies on the basis of experience. In particular, we
address the question of whether or not observers can adapt their
visual cue combination strategies on the basis of consistencies
between visual and haptic (touch) percepts. Berkeley (1709), Piaget
(1952), and many others speculated that people calibrate their
interpretations of visual cues on the basis of their motor
interactions with objects in the world. Despite the intuitive appeal
of this hypothesis, it has never been adequately tested. Using a
novel virtual reality environment, we have conducted three experiments
whose results suggest that observers adapt their visual cue
combination strategies based on correlations between visual and haptic
percepts.
|
Applications using Audio-Video Fusion
Jim Rehg
Georgia Institute of Technology |
Analysis of complex audio-visual events using spatially
Distributed Sensors
|
Michael Harville
Hewlett-Packard Laboratories |
Multi-modal perceptual interfaces in the social media project
In our increasingly mobile and active society, people
often find themselves physically separated from their friends and
families, and with fewer large blocks of time to arrange and
participate in social activities with them. The proliferation of
technology is in many ways responsible for this, but it may also help
provide some solutions. In the Social Media project at HP Labs, we are
building an infrastructure that aims to make virtual socialization
among remotely located people a more seamlessly organized and truly
enjoyable experience. Audio and video media play many roles in this
project: 1) media such as music and television can provide a context
around which to socialize; 2) audio and video allow for rich,
expressive forms of communication; and 3) audio and visual machine
perception enable the creation of interfaces that demand little effort
of the participants. This last factor is critical in making the
personal interactions, rather than the intermediating technology, the
focus of the virtual socialization experience.
I will first briefly describe some of the application scenarios and
technology components that are currently the focus of Social Media. I will
then discuss some of the more novel visual and audio perception components
that we hope to integrate into the project. This includes various types of
person detection, recognition, tracking, and activity analysis based on
stereo vision, as well as some work on low-power and distributed approaches
to classic audio perception problems. Finally, I will discuss some of the
multi-modal perception integration scenarios we are considering, and some of
the issues we are encountering. An important theme is that some of the suite
of devices on which Social Media is implemented (such as video displays and
desktop computers) will be in fixed environment locations, while others
(such as PDAs and cellphones) will mobile. We will want these devices to
share with each other some of the various sorts of sensory information they
acquire, and we must decide what levels of representation are best to share,
how to integrate the various spatial coordinate systems involved, how best
to allow one source of perception to guide another, on what devices to do
the computation, and so forth.
|
Trevor Darrell
MIT Artificial Intelligence Laboratory |
Integrated audio/video sensor arrays
|
Afternoon Session     (4:00 PM - 7:00 PM)
Statistical Methods of Fusion
Andrew Blake
Microsoft Research |
Integrated tracking with vision and sound
Stereo sound and vision are complementary modalities in
that sound is good for initialisation (where vision is
expensive) whereas vision is good for localisation
(where sound is less precise). Using generative
probabilistic models and particle filtering, we show
that stereo sound and vision can indeed be fused
effectively, to make a system more capable than with
either modality on its own.
|
John Hershey
University of California San Diego |
Audio-Visual sound separation using hidden markov models
|
Matthew Beal
Gatsby Computational Neuroscience Unit |
Bayesian combination of audio and video modalities
We present a self-calibrating algorithm for audio-visual
tracking using two microphones and a camera. The algorithm uses a
parametrized statistical model which combines simple models of video
and audio. Using unobserved variables, the model describes the
process that generates the observed data. Hence, it is able to capture
and exploit the statistical structure of the audio and video data, as
well as their mutual dependencies. The model parameters are estimated
by the EM algorithm; object templates are learned and automatic
calibration is performed as part of this procedure. Tracking is done
by Bayesian inference of the object location using the
model. Successful performance is demonstrated on real multimedia
clips.
|
Multi-Modal Perception Models
John Jeka
University of Maryland |
Properties of multisensory fusion for human spatial orientation
The advantage of multisensory information is often
expressed in terms of signal enhancement or resolution. Multisensory
signals are more easily detected than information from a single
sensory source. However, when placed into the context of perception
linked to the control of body movement, multisensory fusion has the
additional advantage of resolving ambiguities between movements of
different body components. Human self-orientation requires
multisensory fusion not for signal enhancement, but for a collective
characterization of multi-linked body dynamics. This characterization
entails two basic processes: estimation and control. Dynamic
characteristics of body sway identified with time series models have
led to measures that distinguish estimation from control. Using these
measures to then develop models that hypothesize underlying mechanisms
has resulted in two important findings. First, most of the variability
observed in body sway can be linked to the process of estimating the
center of mass from multisensory information. Standard control theory
algorithms cannot account for how multisensory information is fused
for center of mass estimation without a process we refer to as
|
Javier Movellan
University of California San Diego |
Information integration in humans and machines
We present ongoing work at UCSD's Machine Perception
Laboratory, trying to relate the psychophysics of human perception and
the development of perceptual machines that use multimodal
information. This research program proceeds by finding constraints
under which psychophisical regularities are optimal, analyzing
processing models compatible with such constraints, and testing
whether machine perception systems that adhere to such constraints
perform well. An illustration of the approach will be offered for the
problem of audio-visual speech perception and a psychophisical
regularity known as the Morton-Massaro law.
|
Thomas Anastasio
University of Illinois, Urbana-Champagne |
A probabilistic framework for understanding multisensory enhancement in the superior colliculus
The superior colliculus is organized topographically as
a neural map. The deep layers of the colliculus detect and localize
targets in the environment by integrating input from multiple sensory
systems. Some deep colliculus neurons receive input of only one
sensory modality (unimodal) while others receive input of multiple
modalities. Multimodal deep SC neurons exhibit multisensory
enhancement, in which the response to input of one modality is
augmented by input of another modality. Multisensory enhancement is
magnitude dependent in that combinations of smaller single-modality
responses produce larger amounts of enhancement. These findings are
consistent with the hypothesis that deep colliculus neurons use
sensory input to compute the probability that a target has appeared at
their corresponding location in the environment. Multisensory
enhancement and inverse effectiveness can be simulated using a model
in which sensory inputs are random variables and target probability is
computed using Bayes' Rule. Informational analysis of the model
indicates that input of another modality can indeed increase the
amount of target information received by a multimodal neuron, but only
if input of the initial modality is ambiguous. Unimodal deep
colliculus neurons may receive unambiguous input of one modality and
have no need of input of another modality.
|
Shihab Shamma
University of Maryland |
TBA
|
|