Extending the Link Grammar Parser's lexicon from UMLS' Specialist lexicon

In a paper in the AMIA 2003 Annual Symposium, I present a paper that addresses one small but important part of the long-term attack on the problem of using unstructured text.  To enable general-purpose language processing tools to manipulate medical text, we must augment their typically non-technical vocabularies with a large medical lexicon.  This paper presents a heuristic method for translating lexical information from one lexicon to another, and applies it to import lexical definitions of about 200,000 word senses from the UMLS's Specialist lexicon to the lexicon of the Link Grammar Parser.

Szolovits, P.  Adding a Medical Lexicon to an English ParserProc. AMIA 2003 Annual Symposium.  Pages 639-643. 2003.

That paper links to this Web page for a few further details (below) for which there was no space in the paper, and as a source of downloadable definitions of the medical terms to be used in the Link Grammar Parser.

Using the Medical Vocabulary with the Link Grammar Parser

  1. Download a copy of the Link Grammar Parser.  Minimal instructions are given at its home site.  Distributions seem to exist for Windows and Unix.  Because source code is available, you could of course roll your own for any operating system.
  2. Expand what you got in step 1 into a local directory/folder structure.  In my installation, data files that define the dictionary and other processing rules wind up in a data subdirectory.
  3. Download one of the following .zip files containing additions to the Link Parser Grammar:
    The 200,000-word and phrase vocabulary described in the paper, above.
    The 125,000-word vocabulary that remains if we eliminate multi-word phrases (separated by spaces, hyphens or commas) from the previous.  As mentioned in the paper, many of the phrases may be handled appropriately by the application of normal grammar rules to their components.  At least some idiosyncratic phrases will not be parsed, however.
  4. Expand what you got in step 3.
  5. The contents of the file extra.dict need to be appended to the end of the file 4.0.dict that you got in step 2.
  6. All the files with names of the form extra.n need to be moved to the subdirectory you should have found in step 2.  (Not into that directory's words subdirectory!)
  7. If you now start the Link Parser with default arguments, it should load the complete dictionary, including all the medical additions.

Please let me know if you find the medical vocabulary useful, and if you encounter serious problems due to the vocabulary additions.  (If you have difficulties with the parser itself, I am unable to help.)

2011 Update

In the intevening years since this work was first done, maintenance of the Link Grammar Parser has moved from CMU to the group that develops the AbiWord word processing program. The latest version of the Link Grammar Parser may be downloaded here. Because the less technical parts of the medical vocabulary that I created in 2003 has now been incorporated into the standard dictionary of the Link Grammar Parser, I have revised my distribution to work with that parser as of July 2011 (version 4.7.4). To use the parser with my extensions, do the following:
  1. Download the parser from here. You will need to follow the instructions on how to build and install the parser. What normally works for me is to get a Unix-like shell, go to the directory that holds the parser, and invoke the following:
      sudo make install
    You may need to adjust this depending on the system you are using. Note that although the directory in which you did the make will contain a runnable version of the parser and its data subdirectory will contain its dictionaries, when the program runs, it uses dictionaries at the place where make install copied them, not the ones packaged with the sources!
  2. Download a .zip file of my extensions from here, and unzip it. That will produce a directory named lgp_extras_2011. Follow the directions in the comments at the beginning of the file extra.dict. Remember to append that file to the 4.0.dict file that was installed, not the one that is in the source distribution. The same goes for where to place the extra.n files.
  3. If all goes well, you should see no error messages when starting link-parser. If errors arise, check the installation. If you are running a later version of the parser, errors may have arisen because of redundancy or conflict in new word definitions.

    Right to Use

    I have no authority to grant rights to use either the Link Grammar Parser or any content of the UMLS.  Please see the appropriate organizations (linked above) for appropriate ways to seek permissions.  I am happy to grant anyone permission to use whatever contribution I have made to the above, and request only that the paper be cited.  I definitely had access to and used the Specialist lexicon in my work preparing the augmentation of the Link Grammar Parser's vocabulary.  However, the downloadable files I am providing here do not contain any of the literal content of Specialist except for the words and phrases defined in that lexicon.  (Though the Specialist definitions were used in the translation process, all the words and phrases are now defined in the Link Grammar.)  I do not know what the legal relationship is, therefore, between my augmentation to the Link Grammar and NLM's requirement that anyone who uses components of the UMLS must sign a license agreement to do so.

    --Peter Szolovits

    Details in Mapping Specialist Lexicon Terms to the Link Parser Grammar

    The paper referenced above describes the methods we have developed to let us augment the lexicon of the Link Parser with medical terms drawn from the Specialist lexicon.  That venue left no room to discuss a few technical details of the mapping process.  For the sake of completeness, they are documented here.

    Limited Length of Phrases

    The Link Parser allows a phrase to contain a maximum of (I believe) 59 characters.  Forty of the Specialist terms are longer than this, and therefore cannot be represented.  These include examples such as:

    "American College of Osteopathic Obstetricians and Gynecologists"
    "Autographa californica multicapsid nuclear polyhedrosis virus gp64"
    "Fellow of the American College of Obstetricians and Gynecologists"
    "human t-cell leukemia virus iii lymphadenopathy-associated virus antigen"

    Marked and Unmarked Senses of Phrases and Words

    A more common problem with LP's representation of phrases is that they cannot be marked (as words can) with a suffix indicating the phrase's part of speech.  Thus, in contrast to normal words, where LP can distinguish between the verb "running.v" and the gerund "running.g", we cannot represent the fact that "cross matching" can also be either a verb or gerund because we are not allowed to form "cross matching.v" or "cross matching.g".  This problem arises for 341 phrases.  In this case, we represent only one of the possible phrase senses, without its suffix, in order of preference: a--adjective, n--noun, v--verb, g--gerund.

    A similar problem that arises for both words and phrases in LP is that its lexicon cannot accept lexical definitions for both a word unmarked with a part of speech and the same word marked with one of the acceptable suffixes.  If multiple instances of a word or phrase are each marked with a part of speech suffix, that is acceptable.  Because the lexical transfer process does not always yield a part of speech for the word sense being transferred, we may wind up with new entries for a word unmarked with its part of speech as well as either existing or new entries marked with the part of speech.  Because either of these will yield LP grammars that cannot be properly read into the LP parser, we must suppress them.  If the same word is to be defined in the augmented LP lexicon with two word senses, one with a known part of speech and the other without, we suppress the entry without a known part of speech.  This occurs 413 times in mapping from Specialist to LP.  Unfortunately, this may still correspond to a legitimate word sense for LP, but it could only be represented by inventing a new part of speech suffix.  This is a move I have not yet been willing to do, but is a possible solution to this problem.

    Caveat in Identifying Word Senses in both Source and Target Lexicons

    The mapping formalism demands that, given a word sense in the source lexicon, we can find the "same" word sense in the target lexicon if it exists.  The paper notes that in LP, some words are not annotated with their part of speech, and the complex set of feature formulae used by LP make it impossible to derive the part of speech from the formula.  In these cases, we simply assume that if the word is the same in LP as in Specialist, the LP word should match the word sense from Specialist.  This assumption may generate too many possible definitions for a word we are trying to map.  In practice, fortunately, the indiscernible sets of word senses in the target vocabulary that have this extraneous definition tend to have a small intersection with the word set we start with from Specialist, and therefore they do not lead to mapping errors.  This problem can be much more significant when the indiscernible sets are very small, and may account for some of the hand-corrected mapping errors described in the paper.

    Should Mapping Apply to Words or Word Senses?

    Many words in any natural language are considered lexically ambiguous.  Sometimes such ambiguity is systematic in the language.  For example, "running" is marked in Specialist as either a verb present participle or as a third-person count or uncount noun.  LP's analysis makes it either a gerund or a verb.  For other words, the set of lexically ambiguous interpretations seems more idiosyncratic.  For example, "after" is considered a positive adjective, an ordinary preposition, or a conjunction in Specialist.  What should we take as the "word" to be mapped from one lexicon to another?  Should it be each individual lexical version of the word, or the combination of lexical descriptors?

    We were initially tempted by the second approach, but we will describe problems that make it unsustainable.  The approach assumes that when lexical ambiguity arises from systematic relationships, words that share the same (ambiguous) lexical descriptors should in fact be treated similarly; i.e., they should form an indiscernible set.  For example, the noun interpretation of "running" is indiscernible from 25,262 other nouns, the verb interpretation is indiscernible from 8,208 verbs, but only 86 words (e.g., "C-banding", "autoclaving", "flooding", …) have the identical combination of lexical descriptors as "running" in Specialist.  Unfortunately, the LP lexical descriptions of the 71 (of these 86) words differ from each other greatly.  For example, although "running" has the gerund and verb interpretations in LP, "undertaking" has an additional noun interpretation, and "understanding" has additional noun and adjective interpretations.  Even where interpretations in LP are roughly the same, the detailed formulae describing each word are different.  Thus, LP makes subtler distinctions between "running" and "underpinning" than Specialist does.  As a result, when one tries to map across an appropriate LP formula for "autoclaving" (a word actually unknown to LP but in Specialist and indiscernible from each of these words), this approach fails to tell us which of the possible mapped formulae to use.

    In general, the more detailed is a description the fewer things fit it.  This is certainly true of lexical descriptions.  As we argue here, a description that jointly covers several different word senses of a single word will therefore have a relatively small set of indiscernible words associated with it.  Our mapping process seems to work most reliably when we can identify large indiscernible sets.  Forcing mapping to apply to words, however, leads to small sets.  Further, we see evidence that differences in design between the Specialist and LP lexicons show up more clearly when we take all senses of a word together rather than separately.

    The approach we have adopted therefore  is to treat every lexical meaning of a word that comes from a different syntactic category (part of speech) in the source vocabulary separately.  This leads to much larger indiscernible sets (recall the case of "running", above), yet makes it easier to choose the LP formula to map to an unknown word such as "autoclaving".  If the word unknown in LP has multiple known senses in Specialist, we decouple the mapping process for these senses into separate, simpler parts.

    Last updated 07/08/2011, Peter Szolovits. (And in 2008 to fix punctuation.)