SpeechBuilder

Introduction

The SpeechBuilder utility is intended to allow people unfamiliar with speech and language processing to create their own speech-based application. The focus of SpeechBuilder version 1.0 is to allow developers to specify the knowledge representation and linguistic constraints necessary to automate the design of speech recognition and natural language understanding. To do this, SpeechBuilder uses a simple web-based interface which allows a developer to describe the important semantic concepts (e.g., objects, attributes) for their application, and to show, via example sentences, what kinds of actions are capable of being performed. Once the developer has provided this information, along with the URL to their CGI-based application, they can use SpeechBuilder to automatically create their own spoken dialogue system which they, and others, can talk to in order to access information.

Background

SpeechBuilder makes use of human language technology (HTL) (e.g., speech recognition, language understanding, system architecture, etc) developed by scientists in the Spoken Language Systems Group at the MIT Computer Science and Artificial Intelligence Laboratory. Researchers there are trying to develop next-generation human language technologies which will allow users to converse naturally with computers, anywhere, anytime. In contrast to many current speech-based applications which constrain what a user can say during a dialogue, their goal is to provide much more freedom to the user in the way they talk with computers. In order to demonstrate and improve this technology, they have created several conversational systems which have been publicly deployed on toll-free telephone numbers in North America, including the widely used Jupiter system for weather forecast information, the Pegasus system for flight status information, and the more recent Mercury system for flight information and pricing. If you have not used these systems before, please try them to see how this technology works! (i.e., donate your voice to science!) If you are in the Boston area, you can visit the MIT Museum and try talking to our systems which have a display for output.

Although these applications have been successful, there are limited resources at MIT to develop a large number of new domains. In order to encourage and enable others to build their own domains, the SpeechBuilder utility was created to make it easier for HLT novices to create their own application(s), or for researchers learning about speech and language to create a prototype application which they can subsequently modify manually. If successful, this utility will benefit others by allowing them to taylor an application to their particular interests. In addition, it will facilitate the collection of a wide variety of conversational speech data which can be used to further improve the basic human language technologies used by these applications. SpeechBuilder developers will also stress the ability of HLT technology to be rapidly ported to a variety of application domains with different vocabularies, grammars, knowledge representation, discourse and dialogue structure.

Architecture

A SpeechBuilder application has two basic parts: first, the human language technologies which perform speech recognition, language understanding etc., and second, the application program which takes a semantic representation produced by the language understanding component and determines what information to return to the user. The HLTs are automatically configured by SpeechBuilder using information provided by the developer, and run on compute servers residing at MIT. The application consists of a program (e.g., Perl script) created by the developer using the Common Gateway Interface (CGI) protocol, running on a CGI-capable web server anywhere on the Internet. The semantic representation produced by the HLTs takes the form of conventional CGI parameters which get passed to the application program via standard HTTP protocols.

There are four CGI parameters which are currently used by SpeechBuilder: text, action, frame, and history. As may be surmised, the text parameter contains the words which were understood from the user, while the action parameter specifies the kind of action being requested by the user. The frame parameter lists the semantic concepts which were found in the utterance. In their simplest form, semantic concepts are essentially key/value pairs (e.g., color=blue, city=Boston, etc). More complex semantic concepts have hierarchical structure such as:
time=(hour=11,minute=30,xm=AM), or
item=(object=box,beside=(object=table)))

The following examples illustrate possible action and frame values for different user queries:

turn on the lights in the kitchen
action=set&frame=(object=lights, room=kitchen, value=on)
will it be raining in Boston on Friday
action=verify&frame=(city=Boston,day=Friday,property=rain)
are there any chinese restaurants on Main Street
action=identify&frame=(object=(type=restaurant, cuisine=chinese, on=(street=Main,ext=Street)))
I want to fly from Boston to San Francisco arriving before ten a m
action=list&frame=(src=BOS,dest=SFO, arrival_time=(relative=before,time=(hour=10,xm=AM)))

Since a CGI program does not retain any state information (e.g., dialogue), the history parameter enables an application to provide information back to the HLT servers that can be used to help interpret subsequent queries. The history parameter usually contains the contents of the resolved frame parameter of the previous utterance. For example, in the following exchange the history parameter is used to keep track of local discourse context:

what is the phone number of John Smith
action=identify&frame=(property=phone,name=John+Smith)
what about his email address
action=identify&frame=(property=email)
&history=(property=phone,name=John+Smith)
what about Jane Doe
action=identify&frame=(name=Jane+Doe)
&history=(property=email,name=John+Smith)

The remainder of this document provides information which is intended to help developers use SpeechBuilder to create their own speech application. The section on knowledge representation provides information about the format used by SpeechBuilder to specify concepts and provide linguistic constraint. It also describes how a developer can produce hierarchical frames though the use of bracketing in the example action sentences. The section on using the web interface describes the mechanics of how a developer actually uses the SpeechBuilder utility. Finally, the section on creating a CGI application describes some of the issues involved in parsing the frame and history parameters, and provides details on how a developer can download a startup kit (written in Perl) which provides a useful module for parsing these parameters, as well as a sample application.

Knowledge Representation

Keys and actions

Semantic concepts and linguistic constraints are currently specified in SpeechBuilder via keys and actions. Keys usually define classes of semantically equivalent words or word sequences, so that all the entries of a key class should play the same role in an utterance. All concepts which are expected to reside in a frame must be a member of a key class. The following table contains example keys.

Key Examples
color red, green, blue
day Monday, Tuesday, Wednesday
room living room, dining room, kitchen
appliance television, radio, VCR

Actions define classes of functionally equivalent sentences, so that all the entries of an action class perform the same operation in the application. All example sentences will generate the appropriate action CGI parameter if they are spoken by the user. SpeechBuilder will generalize all example sentences containing particular key entries to all the entries of the same key class. SpeechBuilder also tries to generalize the non-key words in the example sentences so that it can understand a wider range of user queries than were provided by the developer. However, if the user does say something that cannot be understood, the action CGI parameter will have a value of action=unknown, while the frame parameter will contain all the keys which were decoded from speech signal. The following table contains example actions.

Action Examples
identify what is the forecast for Boston
what will the temperature be on Tuesday
I would like to know today's weather in Denver
set turn the radio on in the kitchen please
can you please turn off the dining room lights
turn on the tv in the living room
good_bye good bye
thank you very much good bye
see you later

Note that capitalization is unnecessary in the example sentences. In fact, since SpeechBuilder is case-sensitive, words should be represented consistently in the example sentences. No form of punctuation should be used.

JSGF Formatting

In order to allow a developer to efficiently convey minor variations in structurally similar keys or actions, SpeechBuilder parses a subset of the Java Speech Grammar Format (JSGF). Developers can use these diactritics to specify optional words or word sequences, or can specify several alternate words or phrases as part of an input, by separating them by the vertical bar, |. The parentheses, (), (e.g., (one | two three)), are used to indicate that one of the elements is to be used. The square brackets, [], (e.g., [please], [one | two]) are used to indicate an optional item or items.

For example, a developer could enter the sentence, [please] (put | place) the red box on the table. SpeechBuilder would automatically treat this as if the developer had entered all four variations of the sentence:
put the red box on the table
please put the red box on the table
place the red box on the table
please place the red box on the table

Regularizing semantically equivalent concepts

In addition to the standard JSGF markup characters, the curly braces, {}, can be used to regularize the output form of a set of alternative entries which are semantically equivalent. The default output form for any key entry is just the value of the entry itself. For example if a city class contains entries such as Boston and Philadelphia, the corresponding frame representation would be city=Boston, or city=Philadelphia, respectively.

In cases where there are alternative ways of saying the same semantic concept however (e.g., Philly), the curly braces can be used to produce a consistent output form for all of the alternatives. In this example, this could be accomplished by having an additional entry of Philly {Philadelphia} in the city class, or by modifying the Philadelphia entry to be (Philadelphia | Philly) {Philadelphia}. Either method would ensure that the use of the word Philly would produce a frame output of city=Philadelphia.

The ability to control the output form of a key entry gives the developer added flexibility, and can make the application program easier to create. First, the application does not need to know about all of the possible variations for a given concept. Second, it allows the developer the ability to perform simple translations. For instance, in recognizing the phone number, the developer might add translations like one {1} to ensure that the application receives an actual numerical phone number. Similarly, a key containing cities might map names to appropriate codes (e.g., Boston [Massachusetts] {BOS}), to reduce processing in the application.

Since the action CGI parameter is determined entirely by which action entry matched the utterance, there is no need for the {} markup characters in the action entries.

Handling ambiguities

In order to deal with ambiguous keys, it is possible to force SpeechBuilder to treat a set of words in an action as having originated from a certain key. For instance, if Metallica is both an artist and an album, and an example reads, tell me what songs were sung by Metallica, then SpeechBuilder will, by default, choose either the album key or the artist key arbitrarily (and possibly incorrectly). In cases like this, the developer can specify which key class to use by enclosing the ambiguous words with <> diacritics, and placing the name of the key class in front, followed by =. In the above example, tell me what songs were sung by artist=<Metallica> would force SpeechBuilder to treat Metallica as an artist when trying to generalize that particular sentence. Note that a simpler way to handle ambiguity is to use unambiguous terms wherever possible in the example sentences.

Occasionally there will be words (or possibly even phrases) with multiple realizations in a domain, only a subset of which are considered to be semantically important by the developer. A typical scenario are words which have alternative syntactic uses. For example, the word may can be used as a noun (I would like a flight on may third), or as a verb (may I get a flight from Boston to Dallas). If the developer simply enters the word may in a month class, the output form month=may will be produced for both example sentences. This outcome can be avoided by using different representations for different variations of a word (e.g., may vs. May). Since SpeechBuilder is case-sensitive, it treats these variations as completely different entries (although their pronunciations are the same). Note that the developer must take care to use a consistent format thoughout the example sentences.

Specifying hierarchical concepts

SpeechBuilder allows the developer to build a structured grammar when this is desired. To do this, the developer needs to bracket parts of some of the example sentences in the action classes, in order that the system may learn where the structure lies. If the developer chooses to bracket a sentence, they must bracket the entire sentence, since SpeechBuilder will treat it as the complete description of that example. Also, SpeechBuilder only generalizes hierarchy within a particular action class. This means that if two separate actions exhibit similar hierarchical structures, at least one sentence in each class must be fully bracketed by the developer.

To bracket a sentence, the developer encloses the substructure which they wish to separate in parentheses, preceded by a name for the substructure followed by either == or =, depending on whether the developer desires to use strict or flattened hierarchy.

Bracketing results in SpeechBuilder creating hierarchy in the frame parameter. Hierarchy can also be more than one level deep, and can mix both types of hierarchy. Note that bracketing a sentence only involves pointing out the hierarchy -- the keys are still automatically discovered by SpeechBuilder.

Strict Hierarchy

The developer can specify strict hierarchy by using the == in bracketing a subsection of text. When strict hierarchy is used, all of the keys under the bracketed region are treated as they normally would, and each key becomes a key=value pair within the subframe, as described previously. This provides a consistent means to bracket subsections, yet have each subsection retain the same keys and values as the developer would expect in a flat grammar. It also provides increased flexibility, since multiple levels of recursion will generate a consistent structure which is easy for the application to deal with.

For instance, if the following sentence was bracketed as Please put source==(the blue box) destination==(on the table), its frame would look like source=(color=blue, object=box), destination=(object=table) With an extra level of hierarchy, Please put source==(the blue box) destination==(on the table in the location==(kitchen)) would become source= (color=blue, object=box), destination= (relative=on, object=table, location= (room=kitchen))

Flattened Hierarchy

It is not always desirable to receive all of the key/value pairs within a hierarchical section of the grammar. For instance, a query in a flight domain might read, Book me a flight from Boston to Dallas. In this case, an example using strict hierarchy such as Book me a flight from source==(Boston) to destination==(Dallas), would result in a frame like, source=(city=Boston), destination=(city=Dallas). However, if the developer knows that sources and destinations are always cities, it might be simpler just to receive source and destination as keys, without the nested city labels. Flattened hierarchy allows the developer to do exactly this.

When the developer specifies flattened hierarchy by using an = in bracketing a subsection of text, instead of the value of the subsection name being a set of key/value pairs enclosed in parentheses, the value will be composed of all of the keys inside the parentheses, separated by spaces. For the previous example, we could bracket the sentence as Book me a flight from source=(Boston) to destination=(Dallas), and the resulting frame would be source=Boston, destination=Dallas (i.e., without any city= key/value pairs or parenthesized subsections). Thus, if the developer knows the type of key that will appear within a hierarchical subsection, he or she won't have to deal with parsing the key's name.

If more than one key is of the same value, SpeechBuilder will separate those values by underbars. This allows developers to easily build parameters which are made up of more than one entry from a class, like a phone number, without having to dig through the key name for each entry. For instance, if there was a digit class holding the digits zero through nine, and the developer gave the example, Who has the phone number number=(two five three seven seven one nine) SpeechBuilder would return a frame containing number=2537719 (if the digits were keys, and were reduced with {}'s to their numeric form.)

Although it might be possible to automatically determine which regions should be flattened by the number of concepts inside, we chose to let the developer specify the type of hierarchy they wanted so that they could be assured of exactly how the system would act.

Using the web interface

Registering as a developer

To use the SpeechBuilder web interface, you must register as a developer to get an account. Whenever you visit the SpeechBuilder site, you will need to enter your account name and password to gain access to your applications. If you click on the cancel button in the pop-up login window the first time you visit the site, you will automatically be taken to the registration page to provide basic information such as your name and email address, and select a developer id (your user name) and a password. You can subsequently use this user name and password to login to SpeechBuilder and create your speech applications. You will be emailed a code number which you will need when you call the developer telephone number to talk to your application.

Building a speech application

When you login to SpeechBuilder, you will be able to modify or delete any of the applications which you have previously created, or create new ones. To create a speech application, a developer needs to provide to SpeechBuilder 1) a comprehensive set of semantic concepts, and example queries for their particular domain (specified in terms of keys and actions), and 2) the URL of a CGI script which will take the CGI parameters produced by for a user query, and provide the appropriate information. Once this has been done, the developer 1) presses a button for SpeechBuilder to compile the information for their application into a form needed by the human language components, 2) presses another button to start the human language components for their application running (on our web server), and 3) calls the SpeechBuilder developer phone number and starts talking to their system. When the developer calls the developer phone number they will be first asked to enter their developer code number (which is emailed to you when you register). The SpeechBuilder system will then automatically switch to your particular application which is running.

Selecting a Domain

When a developer first logs in to SpeechBuilder, they are presented with a domain selection menu which allows them to select and edit one of the existing domains in their directory. The menu also contains a set of buttons allowing them to Add, Remove, Copy, and Rename domains.

Editing a Domain

Once a domain has been selected for editing, the developer is presented with a domain editing form. The upper right portion of this form contains a summary of the semantic classes which have been defined for the domain. There are three types of semantic classes which are listed. The first two lists contain all of the names of the actions and keys which have been explicitly defined by the developer. The third contains a list of all of the hierarchical classes which have been automatically determined from the developer's bracketing. These are called H-Keys, for hierarchical keys, since they represent concepts of hierarchy in the domain. This list can help the developer spot mistakes, such as making typographical errors, or using two different names for the same concept.

The domain editing form also contains a detailed listing of all the classes in the domain. For each class, it tells you whether the class is a key or an action, and allows you to change its designation. It also gives Edit and Delete buttons for each individual class, allowing you to modify them or delete them one at a time.

At the bottom of the list of classes is a text box, with a corresponding Add button and drop-down box which allow you to name a new class, and add it as either a key or an action.

Below the class list is a box where the developer can enter a URL for the domain's CGI-based application. Whatever the developer enters into this box will be automatically contacted by SpeechBuilder whenever someone uses the domain. By changing this entry, the developer can use an application located anywhere on the Internet, and even switch applications when necessary.

Below the URL box, SpeechBuilder has a set of five buttons. The first one, Apply Changes, makes any modifications made to the domain permanent. Other action buttons, such as those that add or edit classes, also make other changes to the domain permanent.

The Reduce button takes all of the example sentences given by the developer and simulates running them through the final system, showing a table containing the utterances and the CGI parameters which they would generate, as depicted here. This allows the developer to debug the domain and make sure that all of their examples work as expected. Note that a developer may click on any of the reduced sentences to see the parse tree for that sentence.

The Build button tells SpeechBuilder to build all the necessary internal files to actually run the domain. The Force Build button is slightly more agressive at clearing and rebuilding the internal files.

The Start and Stop buttons start and stop the actual human language technology servers for that particular domain. The domain which is run is configured as it was the last time the developer clicked the Build button -- any changes since that point, while being retained by SpeechBuilder, are not used in the actual running system. In the current SpeechBuilder setup, the most recent domain to be run is the one the user is connected to when they phone SpeechBuilder, although in the future, we will have several ways of allowing multiple simultaneous SpeechBuilder domains to be run.

Adding and Removing Entries

When the developer picks a particular class to edit, another section of the SpeechBuilder interface appears. The class editor is identical for both keys and actions, and contains two text editing windows. The first window lists all of the entries in the current class. The developer can select one or more existing entries from the list. The second window is initially empty, and allows the developer to add entries. The developer can type one entry per line, and when changes are applied, these entries are made permanent and moved into the existing entry list.

Next to the existing list are a pair of radio buttons, marked edit and delete. If the edit button is selected when the developer applies the changes, all of the selected entries are copied to the second box and removed from the existing list. This allows the developer to modify existing entries without having to retype them. If the delete button is selected, then the chosen entries are simply removed from the class.

XML representation

The concepts and actions specified by the developer are stored in an XML representation which is stored on our local filesystem. Since the SpeechBuilder utility is a CGI script, the file is modified every time changes are made to the domain. If a developer wishes to edit the XML file themselves, it is possible to download the XML file by selecting that option at the upper left of the SpeechBuilder utility. Similarly, it is possible to upload an XML file into the users SpeechBuilder directory. Note that an uploaded XML file will completely replace the contents of any existing XML file in that domain, so care should be used when exercising this capability. Also note that although an XML parser is used to check the syntax of any uploaded XML file, a developer should use caution when editing an XML file manually.

Deploying a speech application

As mentioned earlier, once a domain has been built a developer can press the Start button to deploy an application domain. They can then talk to their system by calling the SpeechBuilder developer telephone number and providing their developer code (which is emailed to the developer when they register).

Once a speech application is sufficiently robust, it may be possible to deploy it on a wider scale to the general public via a toll-free number. When a developer feels that their application is at this stage, they should contact us via email to pursue this matter further.

Creating a CGI application

In addition to specifying constraints and example sentences for their domain, the developer needs to create the program which will provide the actual domain-specific interaction to the user. To do this, the developer needs to have access to a CGI-capable web server, and place the script to be used at a URL matching the one specified to SpeechBuilder. Because of the flexibility of CGI, it doesn't matter whether the CGI program is actually a Perl script, a C program pretending to be a web server itself, an Apache module, or any other particular setup, as long as it adheres to the CGI specification. All of our testing to date has been done using Perl and CGI.pm. We provide each developer with a sample application domain when they register, and provide a useful Perl module for parsing the semantic arguments for developers creating their CGI script. Note that each domain is initially set to a SpeechBuilder URL which will echo the text parameter of the incoming CGI arguments.

Processing Input

The first thing the CGI script needs to do is produce valid HTTP headers. Most CGI packages should provide the ability to do this easily. The text, action, frame, and history parameters are all passed as individual CGI parameters. The first parameter is action, which simply tells which action matched the user's utterance. When SpeechBuilder first receives a new call, it sends the action ###call_answered### so that the application can welcome the user to the domain.

The frame parameter contains information on which keys were found in the parsing of the utterance. If the domain is hierarchical, it may have several levels of hierarchy built in. Unless the domain is very simple (perhaps containing one key per utterance), this variable is very difficult to use in its default form. The application will probably want to parse the frame (which has a fairly regular structure) into some internal representation. For our domains, we built a Perl function to parse a frame and return a hash tree containing the structure of the frame. This means that at any level in the hash tree, the keys were all of the key types that were found, and the values were either the value of the key (if it wasn't hierarchical) or another hash table containing the inner context (if it was hierarchical). This makes it very easy to check whether specific keys exist in the frame, or to extract hierarchical information without trouble.

Generally, the application should first check which action was given. If the action is unknown, then the script can either attempt to get some information out of the keys in the frame, or simply ask the user to try again. After the application has decided which action it is dealing with, it needs to check the appropriate keys. In some cases, the same action can use multiple sets of keys. For instance, in the house domain with the action turn, a room may or may not be present, depending on whether the user said, Turn the lights in the kitchen on, or Turn all the lights on. The script can simply check for the existence of certain keys to determine which form was used, and take the appropriate action.

Generating Output

To tell SpeechBuilder what to say to the user, the program needs only to print the English-language sentence, which will in turn be taken by the CGI mechanism of the web server and sent to the speech synthesis server running at MIT. This reply can only be one line long, and must end with a carriage return. However, the line can be essentially as long as the developer wants, and can contain multiple sentences. In order to end a call, simply prefix the final response with ###close_off###, and SpeechBuilder will hang up after speaking the last sentence.

To use the history mechanism, the script should simply print a line starting with history=. The line can contain any data the developer wants to remember, and must also end with a carriage return. Further, the history line must occur before the line to be spoken, since everything after that is ignored.

The way we used the history mechanism in our test scripts was to parallel the frame structure. In addition to a Perl function to parse the frame, we created a function which would take a hash tree structure (like that generated by the frame parser), and produce a single-string history frame. This allows the application to make changes to such a structure to keep track of the current focus (such as who the user asked about last). The script can then encode this into a history string, and when it is received by the next call of the script, decode it back into the structure it started with.

By doing all this, a script can have a fairly complex interaction with a user, understanding what the user requests, responding appropriately, and keeping track of the course of the conversation as it goes -- all using some very simple mechanisms to interact with the main SpeechBuilder system.

Note that we provide each developer with a sample application domain when they register, and provide a useful module for parsing the semantic arguments for developers creating their CGI script using Perl. This module is included in the SpeechBuilder starter pack. Also included is another sample application (the "flights" domain, which a simplified version of the Mercury travel information system). The starter pack also includes a very basic application CGI script for this application that illustrates the use of the parsing module.

Future Activities

The current version of SpeechBuilder has focused primarily on robust understanding. One of the next phases of research will be to re-design our discourse component so that it may be used by SpeechBuilder. Although the history parameter provides a simple mechanism for the developer to process frames in context, it makes for extra work in the application program. A separate HLT server which could resolve many local discourse phenomenon (as is used for our own domains) would simplify the application processing. The developer interface which will be used to configure the disourse server will revolve around specifying relationships between defined concepts.

Future work will also develop an interface to create mixed-initiative dialogues which can automatically interface with our dialogue module. We have done some initial work in this area and believe we can design a relativley simple interface which will enable more complex interactions than are possible with directed-dialogue graph-based approaches. Finally, we would like to develop an interface for our language generation component, so that we can begin to develop multilingual conversational interfaces with SpeechBuilder without having to modify the application program.

Contacting Us

Comments or questions to bug-galaxy@lists.csail.mit.edu