Medical Text Analysis
MIT
Computer Science and Artificial Intelligence Laboratory (CSAIL)
Clinical Decision Making Group
For better or worse, most clinical data accessible to computer processing is
still in the form of unstructured natural language. Although great strides
have been made in formalizing the content of medical descriptions, with the
exception of billing data and (in many places) lab results and pharmacy orders,
very little is actually stored in such formal vocabularies as SNOMED, ICD9, etc.
Instead, doctors' and nurses' notes, reports of all sorts of tests, referral
documents, discharge summaries, plans, and most other documents on which
clinical care is based still use "free text" as their representation. (G.
Octo Barnett argues against this phrase, pointing out that it is actually very
costly!)
Recognizing the structure and content of such unstructured medical texts turns out to be a critical need in many of our projects, and forms an active area of research. Results of this work, in the form of runnable programs, that may be of practical use to others:
- Extending the lexicon of the Link Grammar Parser with terms from the UMLS' Specialist lexicon.
- Finding structure in semi-structured documents, such as clinical notes, discharge summaries, etc.
We have also published a number of papers reporting on development of improved methods for recognizing various kinds of information in unstructured text:
- PHI (Personally-identifying Health Information), such as those items whose disclosure the HIPAA regulations prohibit.
- Clinically significant content:
- Medications
- Problems (as in the medical problem list)
- Diagnoses
- Signs and symptoms
- Procedures and other treatments
- Modalities, which distinguish, for example, between a test that has been performed and one that is proposed or planned.
These are references to the papers:
- Nakrin A. TagMeds -- A Tool for Populating eXtensible Markup Language Documents with UMLS Concept Unique Identifiers of Current Medications [S.M.]. Cambridge, MA: EECS, MIT; 2001.
- Szolovits P. Adding a medical lexicon to an English Parser. AMIA Annu Symp Proc. 2003:639-643.
- Bhooshan NR. Classification of Semantic Relations from Syntactic Structures in Medical Text Using the MeSH Hierarchy [M.Eng.]. Cambridge, MA: EECS, MIT; 2005.
- Long W. Extracting Diagnoses from Discharge Summaries. Symposium of the American Medical Informatics Association, 2006. Washington, DC; 2005.
- Sibanda T. Was the Patient Cured? Understanding Semantic Categories and Their Relationships in Patient Records [M.Eng.]. Cambridge, MA: EECS, MIT; 2006.
- Sibanda T, Uzuner Ö. Role of Local Context in De-identification of Ungrammatical, Fragmented Text. Proceedings of the North American Chapter of Association for Computational Linguistics/Human Language Technology (NAACL-HLT 2006). New York, NY; 2006.
- Uzuner Ö, Szolovits P, Kohane I, eds. i2b2 Workshop on Natural Language Processing Challenges for Clinical Records. Washington, DC; 2006.
- Bramsen P, Deshpande P, Lee YK, Barzilay R. Finding Temporal Order in Discharge Summaries. Symposium of the American Medical Informatics Association, 2006. Washington, DC; 2006.
- Sibanda T, He T, Szolovits P, Uzuner ÖA, Washington, DC, November 11-15, 2006. Syntactically-Informed Semantic Category Recognizer for Discharge Summaries. Proceedings of the Fall Symposium of the American Medical Informatics Association; 2006.
In addition, Ozlem Uzuner ran the 2006 i2b2 Workshop on Natural Language Processing of Clinical Data in conjunction with the American Medical Informatics Association's Annual Symposium. The workshop focused on two NLP challenges and drew 18 team submissions for the two problems:
- De-identification of unstructured clinical text
- Determination of the smoking status of a patient from clinical discharge summaries
In preparation for that workshop, we also created a simple web site to demonstrate programs we had developed to help initially de-identify the data that became the corpus for the challenge problems.
Last updated
11/30/2006
, Peter Szolovits.