ENTERPRISE COMPUTING
Speech recognition makes some noise
Kimberly Patch & Eric Smalley
02/02/98
InfoWorld
Copyright © InfoWorld 1998
Mention speech recognition these days, and it's almost inevitable that someone will point
to HAL, the computer from 2001: A Space Odyssey. This illustration of where the technology is
headed has lulled many IT managers into ignoring speech recognition because it's obvious
that computers that can hold an intelligent conversation will remain science fiction for a
long time. The trouble is, practical, usable speech-recognition products are here now.
Systems that recognize ordinary speech are sweeping through the call-center market and are poised
to dramatically alter the very nature of desktop computing. IS managers are in danger of being
caught flat-footed. Most at risk are those deploying or planning for new desktop computers,
client/server networks, or transaction-processing systems.
Speech-recognition systems require large amounts of resources, including processing power, memory
and network bandwidth. A failure to account for speech recognition could derail carefully laid
plans for allocating those resources. Worse, it could disrupt strategic plans such as adopting
thin clients.
Picture the road warriors in your company for a moment. Using laptops, they probably log in to
the corporate network several times per day. But what if, instead, they used their cell phones,
which seem glued to their ears anyway, to check their e-mail and their voice mail, as well as to
query the order-processing system and pull documents from a file server? And that's just in-house
use. Now imagine that every telephone in every home is a Web browser. How robust did you say your
extranet is?
A little further out, speech is going to dramatically broaden what constitutes data, bringing about
the long-promised multimedia revolution. Speech technology will accomplish this because it will
enable practical content processing: the ability to easily search for and access audio and video
material online.
"You're going to be able to annotate and index information, which is voice in nature," says Victor
Zue, head of the Spoken Language Systems Group and associate director of the Laboratory for Computer
Science at MIT, in Cambridge, Mass. "The vast amount of information that is in voice mail, that's
in [recorded] meetings ... plus news broadcasts, plus entertainment. All those things are going to
be indexable."
THE PARTS OF SPEECH. Speech-recognition technology, often incorrectly identified as voice recognition,
has several components: noise-canceling input, a recognition engine, vocabularies, application
interfaces, and rudimentary natural-language processing. (Voice recognition refers to voice-print
security systems, commonly called voice ID.) There are two classes of speech-recognition technology:
speaker-dependent, in which the user has to train the system to recognize his or her voice, and
speaker-independent.
There are also two principal categories of speech recognition: keyboard and keypad. Keyboard
applications allow users to speak directly to their computers, complementing or replacing the computer
keyboard.
Keypad applications use speech to replace the telephone keypad as input for accessing voice mail and
navigating a telephone system's menus. More important, they also allow the telephone to act as a
remote computer peripheral.
"Your phone is your personal digital assistant," says Xuedong Huang, research manager for the
Speech Technology Group at Microsoft Research, in Redmond, Wash. "You don't need to carry anything
else. You can always be in touch with your computer."
Keypad applications tend to use limited vocabularies because they are focused on fairly narrow
subjects. Limited vocabularies make it easier for these applications to be speaker-independent. The
limited scope also allows for some elements of natural-language processing. Keyboard applications,
particularly full dictation programs such as IBM's ViaVoice and Dragon Systems' NaturallySpeaking,
tend to use larger vocabularies that, for now, require them to be speaker-dependent.
The 1997 breakthrough that has jump-started the speech-recognition market was the release of products
based on large vocabulary, continuous speech-recognition engines. Until then, large vocabulary systems
were limited by discrete speech-recognition engines that required users to pause between each word.
At the same time, natural-language technology is progressing rapidly. Natural language is the
capability of a computer to decipher the meaning in ordinary, everyday speech, rather than
requiring users to speak in prescribed patterns. A very limited application of the technology,
which relies on the computer to decipher meaning from keywords, allows users of IBM's ViaVoice
Gold to format Word documents.
"1998's going to be the year when the flood gate opens," says Ken Landoline, area director at
Giga Information Group, a market research company in Cambridge, Mass.
Within the next several years, speech input will become commonplace, according to Jackie Fenn,
vice president and research director of advanced technologies at the Gartner Group, in Stamford,
Conn.
"By 2001 we'll see around 30 percent of users using speech recognition for some aspect of their
daily work," Fenn says.
Financial-services companies appear to be at the forefront of adopting speech-recognition technology,
both in call-center applications for customers and desktop applications for workers.
Chase Manhattan has literally removed the computer keyboards in the office of its Global Trust
Services that processes bearer bonds, according to Nicholas Papanikolaw, senior vice president
and chief operating officer of the Global Trust Services. The company uses speech-recognition
technology to boost efficiency, reducing the time it takes to process a single bond from 7 minutes
to less than 1 minute, Papanikolaw says.
Because the application is a narrow one -- only processing one type of bond -- the bank has been
able to develop a speaker-independent, natural-language application based on 200 keywords. This
allows workers to say phrases such as "I want IBM" and "gimme GM." Chase Manhattan developed its
application using technology from UmeVoice, in Novato, Calif.
"If you look at any application that has a keyboard, I believe we can replace it with voice
technology," Papanikolaw says. "You can certainly look at any data-entry application in the
bank."
NOT IN MY BACK OFFICE. It's not a stretch to accept that speech recognition will quickly pervade
the call-center market, particularly in the financial and travel-services industries. But for most
IT managers it's another matter when considering desktop computer users. After all, why change when
the keyboard and mouse have done the job for years, repetitive strain injuries aside? And who wants
to add to the noise level in cubicle-filled work environments?
Efficiency gains such as those at Chase Manhattan are certainly incentive enough for IT managers
who can identify specific applications that lend themselves to spoken input. After all, speech is
the most natural means humans have for conveying information. However, social factors should not
be discounted when measuring resistance to new technology.
Through the '80s, most PC users viewed the mouse-GUI combination as a tool for graphic artists
and engineers working on Macintosh and high-end Unix systems. There did not seem to be a compelling
reason to replace the familiar and relatively speedy command-line DOS interface with a new, awkward
point-and-click interface.
But just as DOS applications hung around for years after Windows burst on the scene, no one is
predicting that keyboards are going to disappear overnight when speech input takes hold. The
bottom line is that the industry often accepts a technology because Microsoft incorporates it.
"The big question mark is obviously when Microsoft is going to start bundling [speech recognition]
with the Office suites or the operating system," Fenn says. "That's going to have a big impact on
the rate of adoption."
Like many companies, State Farm Insurance is keeping an eye on Microsoft, according to a company
representative. State Farm, in Bloomington, Ill., is evaluating current speech-recognition products
in-house, and is developing a speech-enabled camera application that will allow adjusters in the
field to annotate photographs, he said.
Microsoft officials declined to comment on when and how the company would offer speech-recognition
technology. Publicly, the company is focusing its efforts on promoting its Speech API (SAPI).
ACCOMMODATING SPEECH. With the technology on the market and Microsoft poised once again to alter
the landscape, how do IT managers meld speech recognition into corporate networks? As far as the
technology is concerned, there appears to be little reason to rush a decision.
"For probably the next year at least, [speech recognition] should be viewed as a tactical investment,"
Fenn says. "You probably don't want to commit to a corporate rollout until the products are hitting
their second or third rounds."
But now is probably the right time to plan for the technology, especially for IT shops that are
moving to network computers.
"If your strategy is to have all NCs in your next installation, you need to think about where you're
going to put the voice processing," says Amy Wohl, president of Wohl Associates, in Narberth, Pa.
"We think [vendors] are going to do server-side voice processing eventually, [but] there isn't very
much of that yet."
IT managers "also need to think about bandwidth for their network because if they're going to use
server-side voice processing, that's going to mean they're going to ship this stuff up and down the
network," Wohl says.
IBM is working on a client/server version of ViaVoice, says Joe Orlando, worldwide marketing manager
for IBM's ViaVoice. To handle the bandwidth crunch, IT staffs will need to use a tiered approach in
which an intermediate layer of servers handles speech processing rather than back-end data servers,
he said.
For handling speech processing on the desktop, the key factors are processing power and memory.
Current large vocabulary speech-recognition products have minimum requirements of 166-MHz processors
and 32MB of memory, although users are finding that 200-MHz processors and 64MB of memory are the
threshold for adequate performance.
So, the current installed base of 90-MHz, 120-MHz, and 133-MHz desktop systems are unable to support
speech recognition, but this should be a short-term problem. Better compression will boost
speech-recognition products' efficiency and the installed base of desktop computers will eventually
roll over to higher performance systems.
Transaction processing is another area in which IT managers will have to account for speech
recognition. The technology is improving the efficiency of call centers, which allows companies to
expand business, thereby increasing transaction volume.
American Express is rolling out a speech-recognition system that will allow its corporate travel
customers to get information and book their flights. The company expects the system, which is
designed to augment human agents instead of replace them, should reduce the ratio of calls to
transactions because callers will sort out their options before talking to an agent, according
to David Pereira, senior manager for Corporate Services Interactive at American Express.
PLATFORM SPEECH. Perhaps the biggest impact speech-recognition will have in the short-term is in
software development.
Initially, speech-recognition vendors integrated their products with individual applications.
Microsoft's SAPI and Java Speech API from Sun now allow application developers to "speech-enable"
their products.
"The third phase will be when applications are designed from the first day taking into consideration
that speech is one of the modalities of interaction with them," says David Nahamoo, senior manager
of the Human Language Technologies Department at IBM Research. "That will have tremendous impact on
how applications are designed and developed."
This highlights the possibility of a user interface that bypasses, or at least minimizes, the
importance of Windows. The race to develop such a speech-dominated interface is already on, raising
the specter of a renewed battle between IBM and Microsoft for control of the desktop.
"We're not [saying] that we're going to go out and replace the Windows interface," says IBM's Orlando.
"What we need to figure out is where to take a leadership position in creating a voice-user interface.
That's a whole new ball game."
Kimberly Patch and Eric Smalley are both free-lance writers based in Boston.
Speech futures
Experts predict when and how speech recognition will take hold.
Jackie Fenn, Gartner Group |
|
30 percent of desktop users use
speech recognition every day
|
3 years
|
|
User interface assumes voice input
|
5 years
|
|
HAL
|
50+ years
|
Ezra Gottheil, Hurwitz Group
|
|
Instant transcriptions of audio- and videoconferences
|
3 years
|
Ken Landoline, Giga Information Group
|
|
Continuous speech recognition
|
2 years
|
|
Commonplace in telephony
|
3-5 years
|
|
Speech-enabled appliances use speech
recognition to sort through multiple
databases a la Star Trek
|
8-10 years
|
Amy Wohl, Wohl Associates
|
|
Limited natural language processing
in specific applications
|
2 years
|
|
A general natural-language model
|
5 years
|
|
HAL
|
decades
|
Victor Zue , Massachusetts Institute of Technology
|
|
Limited applications of a conversational interface
|
2-3 years
|
|
HAL
|
decades
|
Koen Bouwers, Lernout & Hauspie
|
|
In the operating system
|
1-3 years
|
|
Handheld PC dictation
|
2-3 years
|
Roger Matus, Dragon Systems
|
|
Microphones as common as mice
|
3 years
|
|
In the operating system
|
5 years
|
Speech attracts a crowd
Speech recognition vendors fall into four categories: speech-to-text dictation, computer
command-and-control, telephony, and electronic assistants. The players range from IBM, Microsoft,
and Philips Electronics to a plethora of start-ups.
Dictation products for professionals, particularly doctors and lawyers, have been on the market for
years. But the spotlight has been on the vendors that offer large vocabulary, general purpose dictation
software. Four companies dominate this field: IBM, Dragon Systems, Philips and, through its
acquisition of Kurzweil Applied Intelligence, Lernout & Hauspie (L&H).
All four offer their products to VARs and developers, and all but Philips sell shrink-wrapped
versions of their products. The continuous speech version is not due until early this year.
The latest strategy is bundling. IBM is bundling ViaVoice Gold with Lotus SmartSuite and has a
deal with AST Research. Dragon Systems has deals with Micron and Digital Equipment.
The mother of all bundling deals, of course, would be with Microsoft. In September, L&H announced
an alliance with Microsoft, which has a substantial speech recognition effort of its own, but the
companies declined to discuss plans.
Products that allow users to control Windows and other desktop OSes have long been on the
market. Companies in the field include Advanced Recognition Technologies, Applied Voice
Recognition, Command, and Verbex Voice Systems.
A large number of vendors focus on vertical markets, usually developing applications that
incorporate recognition engines developed by one of the major companies. UmeVoice is an example
of a vendor in the financial-services market.
Speech-recognition technology is also rapidly transforming the telephony market. Applied Language
Technologies developed a reservation system for United Airlines. Nuance Communications developed a
stock quote system for Charles Schwab & Co. and a travel information system for American Express.
Other companies in the field include PureSpeech and Voice Control Systems.
And in the emerging field of electronic assistants, Wildfire Communications uses speech recognition
in its Enterprise Wildfire call-management system.
|