Modelling Nonlinear Shape-and-Texture Appearance Manifolds

 
Experimental Setup
 
 
Figure 1 -
Select frames of four subjects taken from the AVTIMIT database (left) and multidimensional shape representation used by the nearest-neighbor deformable model (right). The mutlidimensional shape representation is comprised of lip contour, middle top and bottom teeth shape feature points.
 
 

Below we report results using speaking person video sequences of five different people taken from the AVTIMIT database [3] . These five subjects are displayed in Figure 1. To train and test our models we used a total of 9 utterances for each subjects. Of these utterances 3 were used for training and 8 for test. From the 3 trainig utterances we randomly selected 100 frames and labelled them with multi-dimensional shape features for the lips, middle upper and lower teeth (see Figure 1). For each subject, we also took 500 consequtive frames of video from the 6 test utterances to constitute a test set.

Using the 100 frame training set and the lip features of each subject we constructed a separate Active Appearance Model (AAM) [1] and a Gaussian mixture deformable model. The training parameters used to train the AAM and the local AAMs of the Gaussian mixture deformable model are as reported in the publication. The Gaussian mixture deformable model was trained on each subject using 5 mixture components and a three-dimensional PCA space. We found these parameters to work well in our experiments. With the 100 frame training set and the multi-dimensional shape features we also constructed a nearest-neighbor deformable model for each subject. For each test image, we assume that an approximate estimate of the mouth's location in the image is provided by an external mouth detector. Each of the above methods refine this estimate. When optimizing the shape-and-texture weights of the nearest-neighbor model we allowed each weight to take values between 0 and 1 for each subject.

Below we present both qualitative and quantitative results for each method using the above dataset. In the first experiment we show the synthesized mouth image and shape of select frames of the 500 frame test set of each subject. These images highlight cases where the linear model fails to capture the appearance of the mouth and our methods succeed. We then give a quantitative comparison of each method using a box plot of the Root-Mean-Square fit error of each model for the 500 test images. Finally, we evaluate the performance of the nearest-neighbor model for different values of k, the number of nearest-neighbors, and provide quantitative evidence that morphing between a set of local examples is advantageous to simply taking the nearest-neighbor. In the qualitative results reported below, we use an altered version of the shape intersection algorithm described in the publication. With this algorithm we maintain shape features that appear in a majority of the nearest-neighbor examples to compute the final shape.

 
Results
 
 
Figure 2 -
Qualitative comparison of each method across the five subjects. For each subject the synthesized image from each model is displayed along with the input test image and the synthesized multidimensional shape resulting from the nearest-neighbor deformable model. The RMS model fit error is diplayed below each synthetic image. The top row of each subject demonstrates an example for which the AAM performs comparbly to our methods and the bottom two rows of the first four subjects and the second row of the fifth subject illustrate cases where our models outperform the linear model. The bottom two rows of the fifth subject show cases where the nonlinear techniques perform poorly. The nonparametric model performs poorly when there is error in the nearest-neighbor computation. Unlike the simple linear and parametric deformable models, however, the nearest-neighbor model fails more gracefully in that it still converges to a mouth image.
 
 
 
Figure 3 -
Quantitative comparison of each method across subjects (left) and quantitative evaluation of nearest-neighbor deformable model for different k values (right). The above box plots report the RMS error over the 2500 test frames compiled from each of the five subjects. In each plot, the horizontal lines of each box represent the top quartile, median and bottom quartile values, the whiskers give the extent of the rest of the data and the red crosses label data outliers. Of the three methods the AAM performs worst on the mouth dataset and the non-parametric method performs the best. In the nearest-neighbor deformable model, increasing k gives better performance.
 
 

The above figures provide a summary of the qualitative and quantitative comparison of each method on the five subject mouth dataset. Figure 2 provides a qualitative comparison of these models. In the figure, the first row of each example illustrates a test frame where the AAM performs comparably to the nonlinear methods. The last two rows of the first four subjects and the second row of the fifth subject highlight scenarios where the nonlinear methods outperform the AAM. The bottom two rows of the fifth subject show cases where the nonlinear techniques perform poorly.

The nonparametric model performs poorly when there is error in the nearest-neighbor computation. This is illustrated in row three of the fifth subject; the 10 nearest-neighbors used to analyze this example are displayed. These neighbors, although consistent with respect to one-another, look far different from the input image. These erronious nearest-neighbors are possibly due to our use of a naive nearest-neighbor distance metric (Euclidean distance in image space) that is sensitive to such factors as orientation and scale but may also be because the input is poorly representing in the training dataset. We are currently investigating the use of more intelligent distance metrics.

The last row of the fifth subject demonstrates a scenario where the mixture model performs poorly. The Gaussian mixture model fails much in the same way the an AAM does, in that it can extrapolate from the training set to result in non-mouth images. This is expected, since some components of the mixture model may span large portions of the manifold that vary nonlinearly. This error may due to poor model selection but can also be because the object appearance manifold is not well approximated as piecewise linear. The non-parametric model avoids the need to perform model selection and generalizes well to complex manifolds. Also, because the non-parametric model focuses on local regions of the manifold, unlike the simple linear and parametric deformable models, the nearest-neighbor model fails more gracefully in that it still converges to a mouth image, whereas both the linear and mixture deformable models can converge to regions off the manifold that result in non-mouth images.

Figure 3 gives a quantitative comparison of each method, where a box plot of the RMS model error is provided computed over the 2500 frame test sequence comprised of all five subjects. Of the three methods the AAM performs worst on the mouth dataset and the non-parametric method performs the best. Also displayed in the figure is a box plot that evaluates the nearest-neighbor deformable model for different k values. This box plot is also computed over the entire 2500 frame test set. The results indicate that morphing between a set of examples is advantageous to simply taking the nearest-neighbor as the answer.

 
Vision Interfaces Group - May 2005