In a paper in the AMIA 2003 Annual Symposium, I present a paper that addresses one small but important part of the long-term attack on the problem of using unstructured text. To enable general-purpose language processing tools to manipulate medical text, we must augment their typically non-technical vocabularies with a large medical lexicon. This paper presents a heuristic method for translating lexical information from one lexicon to another, and applies it to import lexical definitions of about 200,000 word senses from the UMLS's Specialist lexicon to the lexicon of the Link Grammar Parser.
Szolovits, P. Adding a Medical Lexicon to an English Parser. Proc. AMIA 2003 Annual Symposium. Pages 639-643. 2003.
That paper links to this Web page for a few further details (below) for which there was no space in the paper, and as a source of downloadable definitions of the medical terms to be used in the Link Grammar Parser.
Please let me know if you find the medical vocabulary useful, and if you encounter serious problems due to the vocabulary additions. (If you have difficulties with the parser itself, I am unable to help.)
./configure make sudo make installYou may need to adjust this depending on the system you are using. Note that although the directory in which you did the
makewill contain a runnable version of the parser and its data subdirectory will contain its dictionaries, when the program runs, it uses dictionaries at the place where
make installcopied them, not the ones packaged with the sources!
lgp_extras_2011. Follow the directions in the comments at the beginning of the file
extra.dict. Remember to append that file to the
4.0.dictfile that was installed, not the one that is in the source distribution. The same goes for where to place the
I have no authority to grant rights to use either the Link Grammar Parser or any content of the UMLS. Please see the appropriate organizations (linked above) for appropriate ways to seek permissions. I am happy to grant anyone permission to use whatever contribution I have made to the above, and request only that the paper be cited. I definitely had access to and used the Specialist lexicon in my work preparing the augmentation of the Link Grammar Parser's vocabulary. However, the downloadable files I am providing here do not contain any of the literal content of Specialist except for the words and phrases defined in that lexicon. (Though the Specialist definitions were used in the translation process, all the words and phrases are now defined in the Link Grammar.) I do not know what the legal relationship is, therefore, between my augmentation to the Link Grammar and NLM's requirement that anyone who uses components of the UMLS must sign a license agreement to do so.
The paper referenced above describes the methods we have developed to let us augment the lexicon of the Link Parser with medical terms drawn from the Specialist lexicon. That venue left no room to discuss a few technical details of the mapping process. For the sake of completeness, they are documented here.
The Link Parser allows a phrase to contain a maximum of (I believe) 59 characters. Forty of the Specialist terms are longer than this, and therefore cannot be represented. These include examples such as:
"American College of Osteopathic Obstetricians and Gynecologists"
"Autographa californica multicapsid nuclear polyhedrosis virus gp64"
"Fellow of the American College of Obstetricians and Gynecologists"
"human t-cell leukemia virus iii lymphadenopathy-associated virus antigen"
A more common problem with LP's representation of phrases is that they cannot be marked (as words can) with a suffix indicating the phrase's part of speech. Thus, in contrast to normal words, where LP can distinguish between the verb "running.v" and the gerund "running.g", we cannot represent the fact that "cross matching" can also be either a verb or gerund because we are not allowed to form "cross matching.v" or "cross matching.g". This problem arises for 341 phrases. In this case, we represent only one of the possible phrase senses, without its suffix, in order of preference: a--adjective, n--noun, v--verb, g--gerund.
A similar problem that arises for both words and phrases in LP is that its lexicon cannot accept lexical definitions for both a word unmarked with a part of speech and the same word marked with one of the acceptable suffixes. If multiple instances of a word or phrase are each marked with a part of speech suffix, that is acceptable. Because the lexical transfer process does not always yield a part of speech for the word sense being transferred, we may wind up with new entries for a word unmarked with its part of speech as well as either existing or new entries marked with the part of speech. Because either of these will yield LP grammars that cannot be properly read into the LP parser, we must suppress them. If the same word is to be defined in the augmented LP lexicon with two word senses, one with a known part of speech and the other without, we suppress the entry without a known part of speech. This occurs 413 times in mapping from Specialist to LP. Unfortunately, this may still correspond to a legitimate word sense for LP, but it could only be represented by inventing a new part of speech suffix. This is a move I have not yet been willing to do, but is a possible solution to this problem.
The mapping formalism demands that, given a word sense in the source lexicon, we can find the "same" word sense in the target lexicon if it exists. The paper notes that in LP, some words are not annotated with their part of speech, and the complex set of feature formulae used by LP make it impossible to derive the part of speech from the formula. In these cases, we simply assume that if the word is the same in LP as in Specialist, the LP word should match the word sense from Specialist. This assumption may generate too many possible definitions for a word we are trying to map. In practice, fortunately, the indiscernible sets of word senses in the target vocabulary that have this extraneous definition tend to have a small intersection with the word set we start with from Specialist, and therefore they do not lead to mapping errors. This problem can be much more significant when the indiscernible sets are very small, and may account for some of the hand-corrected mapping errors described in the paper.
Many words in any natural language are considered lexically ambiguous. Sometimes such ambiguity is systematic in the language. For example, "running" is marked in Specialist as either a verb present participle or as a third-person count or uncount noun. LP's analysis makes it either a gerund or a verb. For other words, the set of lexically ambiguous interpretations seems more idiosyncratic. For example, "after" is considered a positive adjective, an ordinary preposition, or a conjunction in Specialist. What should we take as the "word" to be mapped from one lexicon to another? Should it be each individual lexical version of the word, or the combination of lexical descriptors?
We were initially tempted by the second approach, but we will describe problems that make it unsustainable. The approach assumes that when lexical ambiguity arises from systematic relationships, words that share the same (ambiguous) lexical descriptors should in fact be treated similarly; i.e., they should form an indiscernible set. For example, the noun interpretation of "running" is indiscernible from 25,262 other nouns, the verb interpretation is indiscernible from 8,208 verbs, but only 86 words (e.g., "C-banding", "autoclaving", "flooding", ) have the identical combination of lexical descriptors as "running" in Specialist. Unfortunately, the LP lexical descriptions of the 71 (of these 86) words differ from each other greatly. For example, although "running" has the gerund and verb interpretations in LP, "undertaking" has an additional noun interpretation, and "understanding" has additional noun and adjective interpretations. Even where interpretations in LP are roughly the same, the detailed formulae describing each word are different. Thus, LP makes subtler distinctions between "running" and "underpinning" than Specialist does. As a result, when one tries to map across an appropriate LP formula for "autoclaving" (a word actually unknown to LP but in Specialist and indiscernible from each of these words), this approach fails to tell us which of the possible mapped formulae to use.
In general, the more detailed is a description the fewer things fit it. This is certainly true of lexical descriptions. As we argue here, a description that jointly covers several different word senses of a single word will therefore have a relatively small set of indiscernible words associated with it. Our mapping process seems to work most reliably when we can identify large indiscernible sets. Forcing mapping to apply to words, however, leads to small sets. Further, we see evidence that differences in design between the Specialist and LP lexicons show up more clearly when we take all senses of a word together rather than separately.
The approach we have adopted therefore is to treat every lexical meaning of a word that comes from a different syntactic category (part of speech) in the source vocabulary separately. This leads to much larger indiscernible sets (recall the case of "running", above), yet makes it easier to choose the LP formula to map to an unknown word such as "autoclaving". If the word unknown in LP has multiple known senses in Specialist, we decouple the mapping process for these senses into separate, simpler parts.