Spoken Language Systems
MIT Computer Science and Artificial Intelligence Laboratory

Speech recognition makes some noise

Kimberly Patch & Eric Smalley

Copyright © InfoWorld 1998

Mention speech recognition these days, and it's almost inevitable that someone will point to HAL, the computer from 2001: A Space Odyssey. This illustration of where the technology is headed has lulled many IT managers into ignoring speech recognition because it's obvious that computers that can hold an intelligent conversation will remain science fiction for a long time. The trouble is, practical, usable speech-recognition products are here now.

Systems that recognize ordinary speech are sweeping through the call-center market and are poised to dramatically alter the very nature of desktop computing. IS managers are in danger of being caught flat-footed. Most at risk are those deploying or planning for new desktop computers, client/server networks, or transaction-processing systems.

Speech-recognition systems require large amounts of resources, including processing power, memory and network bandwidth. A failure to account for speech recognition could derail carefully laid plans for allocating those resources. Worse, it could disrupt strategic plans such as adopting thin clients.

Picture the road warriors in your company for a moment. Using laptops, they probably log in to the corporate network several times per day. But what if, instead, they used their cell phones, which seem glued to their ears anyway, to check their e-mail and their voice mail, as well as to query the order-processing system and pull documents from a file server? And that's just in-house use. Now imagine that every telephone in every home is a Web browser. How robust did you say your extranet is?

A little further out, speech is going to dramatically broaden what constitutes data, bringing about the long-promised multimedia revolution. Speech technology will accomplish this because it will enable practical content processing: the ability to easily search for and access audio and video material online.

"You're going to be able to annotate and index information, which is voice in nature," says Victor Zue, head of the Spoken Language Systems Group and associate director of the Laboratory for Computer Science at MIT, in Cambridge, Mass. "The vast amount of information that is in voice mail, that's in [recorded] meetings ... plus news broadcasts, plus entertainment. All those things are going to be indexable."

THE PARTS OF SPEECH. Speech-recognition technology, often incorrectly identified as voice recognition, has several components: noise-canceling input, a recognition engine, vocabularies, application interfaces, and rudimentary natural-language processing. (Voice recognition refers to voice-print security systems, commonly called voice ID.) There are two classes of speech-recognition technology: speaker-dependent, in which the user has to train the system to recognize his or her voice, and speaker-independent.

There are also two principal categories of speech recognition: keyboard and keypad. Keyboard applications allow users to speak directly to their computers, complementing or replacing the computer keyboard.

Keypad applications use speech to replace the telephone keypad as input for accessing voice mail and navigating a telephone system's menus. More important, they also allow the telephone to act as a remote computer peripheral.

"Your phone is your personal digital assistant," says Xuedong Huang, research manager for the Speech Technology Group at Microsoft Research, in Redmond, Wash. "You don't need to carry anything else. You can always be in touch with your computer."

Keypad applications tend to use limited vocabularies because they are focused on fairly narrow subjects. Limited vocabularies make it easier for these applications to be speaker-independent. The limited scope also allows for some elements of natural-language processing. Keyboard applications, particularly full dictation programs such as IBM's ViaVoice and Dragon Systems' NaturallySpeaking, tend to use larger vocabularies that, for now, require them to be speaker-dependent.

The 1997 breakthrough that has jump-started the speech-recognition market was the release of products based on large vocabulary, continuous speech-recognition engines. Until then, large vocabulary systems were limited by discrete speech-recognition engines that required users to pause between each word.

At the same time, natural-language technology is progressing rapidly. Natural language is the capability of a computer to decipher the meaning in ordinary, everyday speech, rather than requiring users to speak in prescribed patterns. A very limited application of the technology, which relies on the computer to decipher meaning from keywords, allows users of IBM's ViaVoice Gold to format Word documents.

"1998's going to be the year when the flood gate opens," says Ken Landoline, area director at Giga Information Group, a market research company in Cambridge, Mass.

Within the next several years, speech input will become commonplace, according to Jackie Fenn, vice president and research director of advanced technologies at the Gartner Group, in Stamford, Conn.

"By 2001 we'll see around 30 percent of users using speech recognition for some aspect of their daily work," Fenn says.

Financial-services companies appear to be at the forefront of adopting speech-recognition technology, both in call-center applications for customers and desktop applications for workers.

Chase Manhattan has literally removed the computer keyboards in the office of its Global Trust Services that processes bearer bonds, according to Nicholas Papanikolaw, senior vice president and chief operating officer of the Global Trust Services. The company uses speech-recognition technology to boost efficiency, reducing the time it takes to process a single bond from 7 minutes to less than 1 minute, Papanikolaw says.

Because the application is a narrow one -- only processing one type of bond -- the bank has been able to develop a speaker-independent, natural-language application based on 200 keywords. This allows workers to say phrases such as "I want IBM" and "gimme GM." Chase Manhattan developed its application using technology from UmeVoice, in Novato, Calif.

"If you look at any application that has a keyboard, I believe we can replace it with voice technology," Papanikolaw says. "You can certainly look at any data-entry application in the bank."

NOT IN MY BACK OFFICE. It's not a stretch to accept that speech recognition will quickly pervade the call-center market, particularly in the financial and travel-services industries. But for most IT managers it's another matter when considering desktop computer users. After all, why change when the keyboard and mouse have done the job for years, repetitive strain injuries aside? And who wants to add to the noise level in cubicle-filled work environments?

Efficiency gains such as those at Chase Manhattan are certainly incentive enough for IT managers who can identify specific applications that lend themselves to spoken input. After all, speech is the most natural means humans have for conveying information. However, social factors should not be discounted when measuring resistance to new technology.

Through the '80s, most PC users viewed the mouse-GUI combination as a tool for graphic artists and engineers working on Macintosh and high-end Unix systems. There did not seem to be a compelling reason to replace the familiar and relatively speedy command-line DOS interface with a new, awkward point-and-click interface.

But just as DOS applications hung around for years after Windows burst on the scene, no one is predicting that keyboards are going to disappear overnight when speech input takes hold. The bottom line is that the industry often accepts a technology because Microsoft incorporates it.

"The big question mark is obviously when Microsoft is going to start bundling [speech recognition] with the Office suites or the operating system," Fenn says. "That's going to have a big impact on the rate of adoption."

Like many companies, State Farm Insurance is keeping an eye on Microsoft, according to a company representative. State Farm, in Bloomington, Ill., is evaluating current speech-recognition products in-house, and is developing a speech-enabled camera application that will allow adjusters in the field to annotate photographs, he said.

Microsoft officials declined to comment on when and how the company would offer speech-recognition technology. Publicly, the company is focusing its efforts on promoting its Speech API (SAPI).

ACCOMMODATING SPEECH. With the technology on the market and Microsoft poised once again to alter the landscape, how do IT managers meld speech recognition into corporate networks? As far as the technology is concerned, there appears to be little reason to rush a decision.

"For probably the next year at least, [speech recognition] should be viewed as a tactical investment," Fenn says. "You probably don't want to commit to a corporate rollout until the products are hitting their second or third rounds."

But now is probably the right time to plan for the technology, especially for IT shops that are moving to network computers.

"If your strategy is to have all NCs in your next installation, you need to think about where you're going to put the voice processing," says Amy Wohl, president of Wohl Associates, in Narberth, Pa. "We think [vendors] are going to do server-side voice processing eventually, [but] there isn't very much of that yet."

IT managers "also need to think about bandwidth for their network because if they're going to use server-side voice processing, that's going to mean they're going to ship this stuff up and down the network," Wohl says.

IBM is working on a client/server version of ViaVoice, says Joe Orlando, worldwide marketing manager for IBM's ViaVoice. To handle the bandwidth crunch, IT staffs will need to use a tiered approach in which an intermediate layer of servers handles speech processing rather than back-end data servers, he said.

For handling speech processing on the desktop, the key factors are processing power and memory. Current large vocabulary speech-recognition products have minimum requirements of 166-MHz processors and 32MB of memory, although users are finding that 200-MHz processors and 64MB of memory are the threshold for adequate performance.

So, the current installed base of 90-MHz, 120-MHz, and 133-MHz desktop systems are unable to support speech recognition, but this should be a short-term problem. Better compression will boost speech-recognition products' efficiency and the installed base of desktop computers will eventually roll over to higher performance systems.

Transaction processing is another area in which IT managers will have to account for speech recognition. The technology is improving the efficiency of call centers, which allows companies to expand business, thereby increasing transaction volume.

American Express is rolling out a speech-recognition system that will allow its corporate travel customers to get information and book their flights. The company expects the system, which is designed to augment human agents instead of replace them, should reduce the ratio of calls to transactions because callers will sort out their options before talking to an agent, according to David Pereira, senior manager for Corporate Services Interactive at American Express.

PLATFORM SPEECH. Perhaps the biggest impact speech-recognition will have in the short-term is in software development.

Initially, speech-recognition vendors integrated their products with individual applications. Microsoft's SAPI and Java Speech API from Sun now allow application developers to "speech-enable" their products.

"The third phase will be when applications are designed from the first day taking into consideration that speech is one of the modalities of interaction with them," says David Nahamoo, senior manager of the Human Language Technologies Department at IBM Research. "That will have tremendous impact on how applications are designed and developed."

This highlights the possibility of a user interface that bypasses, or at least minimizes, the importance of Windows. The race to develop such a speech-dominated interface is already on, raising the specter of a renewed battle between IBM and Microsoft for control of the desktop.

"We're not [saying] that we're going to go out and replace the Windows interface," says IBM's Orlando. "What we need to figure out is where to take a leadership position in creating a voice-user interface. That's a whole new ball game."

Kimberly Patch and Eric Smalley are both free-lance writers based in Boston.

Speech futures

Experts predict when and how speech recognition will take hold.

Jackie Fenn, Gartner Group
     30 percent of desktop users use speech recognition every day 3 years
     User interface assumes voice input 5 years
     HAL 50+ years
Ezra Gottheil, Hurwitz Group
     Instant transcriptions of audio- and videoconferences 3 years
Ken Landoline, Giga Information Group
     Continuous speech recognition 2 years
     Commonplace in telephony 3-5 years
     Speech-enabled appliances use speech recognition
to sort through multiple databases a la Star Trek
8-10 years
Amy Wohl, Wohl Associates
     Limited natural language processing in specific applications 2 years
     A general natural-language model 5 years
     HAL decades
Victor Zue , Massachusetts Institute of Technology
     Limited applications of a conversational interface 2-3 years
     HAL decades
Koen Bouwers, Lernout & Hauspie
     In the operating system 1-3 years
     Handheld PC dictation 2-3 years
Roger Matus, Dragon Systems
     Microphones as common as mice 3 years
     In the operating system 5 years

Speech attracts a crowd

Speech recognition vendors fall into four categories: speech-to-text dictation, computer command-and-control, telephony, and electronic assistants. The players range from IBM, Microsoft, and Philips Electronics to a plethora of start-ups.

Dictation products for professionals, particularly doctors and lawyers, have been on the market for years. But the spotlight has been on the vendors that offer large vocabulary, general purpose dictation software. Four companies dominate this field: IBM, Dragon Systems, Philips and, through its acquisition of Kurzweil Applied Intelligence, Lernout & Hauspie (L&H).

All four offer their products to VARs and developers, and all but Philips sell shrink-wrapped versions of their products. The continuous speech version is not due until early this year.

The latest strategy is bundling. IBM is bundling ViaVoice Gold with Lotus SmartSuite and has a deal with AST Research. Dragon Systems has deals with Micron and Digital Equipment.

The mother of all bundling deals, of course, would be with Microsoft. In September, L&H announced an alliance with Microsoft, which has a substantial speech recognition effort of its own, but the companies declined to discuss plans.

Products that allow users to control Windows and other desktop OSes have long been on the market. Companies in the field include Advanced Recognition Technologies, Applied Voice Recognition, Command, and Verbex Voice Systems.

A large number of vendors focus on vertical markets, usually developing applications that incorporate recognition engines developed by one of the major companies. UmeVoice is an example of a vendor in the financial-services market.

Speech-recognition technology is also rapidly transforming the telephony market. Applied Language Technologies developed a reservation system for United Airlines. Nuance Communications developed a stock quote system for Charles Schwab & Co. and a travel information system for American Express. Other companies in the field include PureSpeech and Voice Control Systems.

And in the emerging field of electronic assistants, Wildfire Communications uses speech recognition in its Enterprise Wildfire call-management system.

32 Vassar Street
Cambridge, MA 02139 USA
(+1) 617.253.3049

©2016, Spoken Language Systems Group. All rights reserved.

About SLS
---Our Technologies
Research Initiatives
---Research Summary
News and Events
---News Articles
SLS People
---Research Staff
---Post-Doctoral Students
---Administrative Staff
---Support Staff
---Graduate Students
---Undergraduate Students
---Positions with SLS
Contact Us
---Positions with SLS
---Visitor Information