Medical Text Analysis

MIT Computer Science and Artificial Intelligence Laboratory (CSAIL)
Clinical Decision Making Group

For better or worse, most clinical data accessible to computer processing is still in the form of unstructured natural language.  Although great strides have been made in formalizing the content of medical descriptions, with the exception of billing data and (in many places) lab results and pharmacy orders, very little is actually stored in such formal vocabularies as SNOMED, ICD9, etc.  Instead, doctors' and nurses' notes, reports of all sorts of tests, referral documents, discharge summaries, plans, and most other documents on which clinical care is based still use "free text" as their representation.  (G. Octo Barnett argues against this phrase, pointing out that it is actually very costly!)

Recognizing the structure and content of such unstructured medical texts turns out to be a critical need in many of our projects, and forms an active area of research. Results of this work, in the form of runnable programs, that may be of practical use to others:

  1. Extending the lexicon of the Link Grammar Parser with terms from the UMLS' Specialist lexicon.
  2. Finding structure in semi-structured documents, such as clinical notes, discharge summaries, etc.
  3. A Lisp interface to the Link Grammar Parser's API, which we are using in developing a new, dynamic Lisp-based language processing framework that encapsulates various existing tools. (More on this as it becomes commonly usable.)
  4. A load script to load UMLS lexical data into a local MySQL database.

We have also published a number of papers reporting on development of improved methods for recognizing various kinds of information in unstructured text:

  1. PHI (Personally-identifying Health Information), such as those items whose disclosure the HIPAA regulations prohibit.
  2. Clinically significant content:
    1. Medications
    2. Problems (as in the medical problem list)
    3. Diagnoses
    4. Signs and symptoms
    5. Procedures and other treatments
    6. Modalities, which distinguish, for example, between a test that has been performed and one that is proposed or planned.

These are references to the papers:

  1. Nakrin A. TagMeds -- A Tool for Populating eXtensible Markup Language Documents with UMLS Concept Unique Identifiers of Current Medications [S.M.]. Cambridge, MA: EECS, MIT; 2001.
  2. Szolovits P. Adding a medical lexicon to an English Parser. AMIA Annu Symp Proc. 2003:639-643.
  3. Bhooshan NR. Classification of Semantic Relations from Syntactic Structures in Medical Text Using the MeSH Hierarchy [M.Eng.]. Cambridge, MA: EECS, MIT; 2005.
  4. Long W. Extracting Diagnoses from Discharge Summaries. Symposium of the American Medical Informatics Association, 2006. Washington, DC; 2005.
  5. Sibanda T. Was the Patient Cured? Understanding Semantic Categories and Their Relationships in Patient Records [M.Eng.]. Cambridge, MA: EECS, MIT; 2006.
  6. Sibanda T, Uzuner Ö. Role of Local Context in De-identification of Ungrammatical, Fragmented Text. Proceedings of the North American Chapter of Association for Computational Linguistics/Human Language Technology (NAACL-HLT 2006). New York, NY; 2006.
  7. Uzuner Ö, Szolovits P, Kohane I, eds. i2b2 Workshop on Natural Language Processing Challenges for Clinical Records. Washington, DC; 2006.
  8. Bramsen P, Deshpande P, Lee YK, Barzilay R. Finding Temporal Order in Discharge Summaries. Symposium of the American Medical Informatics Association, 2006. Washington, DC; 2006.
  9. Sibanda T, He T, Szolovits P, Uzuner Ö, A Syntactically-Informed Semantic Category Recognizer for Discharge Summaries. Proceedings of the Fall Symposium of the American Medical Informatics Association; Washington, DC, November 11-15, 2006.

In addition, Ozlem Uzuner ran the 2006 i2b2 Workshop on Natural Language Processing of Clinical Data in conjunction with the American Medical Informatics Association's Annual Symposium. The workshop focused on two NLP challenges and drew 18 team submissions for the two problems:

In preparation for that workshop, we also created a simple web site to demonstrate programs we had developed to help initially de-identify the data that became the corpus for the challenge problems.


Last updated 11/21/2011 , Peter Szolovits.