SLS :: Research Initiatives

MIT Computer Science and Artificial Intelligence Laboratory

SLS RESEARCH

Language is our primary means of communication, and speech is one of its most convenient and efficient means of conveyance. In the Spoken Language Systems group we endeavor to create the technologies that enable advanced spoken language interaction between humans and machines. For someone in this line of research, these are exciting times. After many decades of laboratory research, speech technology has reached a tipping point in our society whereby the notion of talking with a computer has become an everyday occurrence via smartphones and other devices that are becoming commercially available. People want to talk to their devices in all aspects of their lives, whether at home, at work, at play, or somewhere in between.

When people think about speech technology, they usually mean much more than just speech recognition, which is the process of identifying what words have been spoken. To do something useful, a machine typically needs to understand the underlying meaning, often in the larger context of a multi-turn interaction, and generate some kind of response to hold up its side of the conversation. Thus, there are a suite of technologies necessary to enable these capabilities.

Speech is more than language however. When we speak, the resulting waveform contains information about our identity, emotional state, health etc., in addition to all the qualities associated with the linguistic message such as which language, dialect, and speaking style we use. Technologies that are capable of extracting relevant information about these different facets of the speech signal will also play useful roles in our lives. Finally, speech recordings also contain information about the local environment, so, ultimately, speech is but one component of a larger audio tapestry that needs to be understood, perhaps jointly with other perceptual modalities such as vision.

The SLS group addresses a broad range of research topics, but they can generally be grouped according to three basic questions: 1) who is talking, 2) what is said, and 3) what is meant. The first area focuses on paralinguistic issues like speaker verification, language and dialect identification, and speaker diarization (i.e., who spoke when). However, we are also beginning to examine health-related issues as they are manifested in the speech signal. The second research area addresses core speech recognition capabilities and addresses challenges related to noise robustness, limited linguistic resources, and unsupervised language acquisition. The third and final area focuses more on the boundary between speech and natural language processing, and includes topics related to speech understanding, but also related areas such as sentiment analysis and dialogue. Some of this research focuses more on open-ended user-generated text content such as social forums.

Research in speech and language processing is highly experimental, typically involving large quantities of either annotated or unannotated data. The mathematical models we create draw heavily from machine learning techniques such as graphical models and deep neural networks. While recent advances are remarkable, they represent only the tip of the iceberg needed to achieve truly natural spoken language human-machine interaction as depicted in science fiction movies and literature.


	32 Vassar Street Cambridge, MA 02139 USA (+1) 617.253.3049