Medical Text Analysis
MIT
Computer Science and
Artificial Intelligence Laboratory (CSAIL)
Clinical Decision Making
Group
For better or worse, most clinical data accessible to computer
processing is still in the form of unstructured natural
language. Although great strides have been made in
formalizing the content of medical descriptions, with the
exception of billing data and (in many places) lab results and
pharmacy orders, very little is actually stored in such formal
vocabularies as SNOMED, ICD9, etc. Instead, doctors' and
nurses' notes, reports of all sorts of tests, referral documents,
discharge summaries, plans, and most other documents on which
clinical care is based still use "free text" as their
representation. (G. Octo Barnett argues against this phrase,
pointing out that it is actually very costly!)
Recognizing the structure and content of such unstructured
medical texts turns out to be a critical need in many of our
projects, and forms an active area of research. Results of this
work, in the form of runnable programs, that may be of practical
use to others:
- Extending the lexicon of the Link
Grammar Parser with terms from the UMLS' Specialist
lexicon.
- Finding structure in
semi-structured documents, such as clinical notes,
discharge summaries, etc.
- A Lisp interface to the Link
Grammar Parser's API, which we are using in developing a
new, dynamic Lisp-based language processing framework that
encapsulates various existing tools. (More on this as it becomes
commonly usable.)
- A
load script to load UMLS lexical data into a local MySQL
database.
We have also published a number of papers reporting on
development of improved methods for recognizing various kinds of
information in unstructured text:
- PHI (Personally-identifying Health Information), such as those
items whose disclosure the HIPAA regulations prohibit.
- Clinically significant content:
- Medications
- Problems (as in the medical problem list)
- Diagnoses
- Signs and symptoms
- Procedures and other treatments
- Modalities, which distinguish, for example, between a test
that has been performed and one that is proposed or planned.
These are references to the papers:
- Nakrin A. TagMeds -- A Tool for Populating eXtensible Markup
Language Documents with UMLS Concept Unique Identifiers of
Current Medications [S.M.]. Cambridge, MA: EECS, MIT; 2001.
- Szolovits P. Adding a medical
lexicon to an English Parser. AMIA Annu Symp Proc.
2003:639-643.
- Bhooshan NR. Classification
of Semantic Relations from Syntactic Structures in Medical
Text Using the MeSH Hierarchy [M.Eng.]. Cambridge, MA:
EECS, MIT; 2005.
- Long W. Extracting
Diagnoses from Discharge Summaries. Symposium of the
American Medical Informatics Association, 2006. Washington, DC;
2005.
- Sibanda T. Was
the Patient Cured? Understanding Semantic Categories and Their
Relationships in Patient Records [M.Eng.]. Cambridge, MA:
EECS, MIT; 2006.
- Sibanda T, Uzuner Ö. Role
of Local Context in De-identification of Ungrammatical,
Fragmented Text. Proceedings of the North American Chapter
of Association for Computational Linguistics/Human Language
Technology (NAACL-HLT 2006). New York, NY; 2006.
- Uzuner Ö, Szolovits P, Kohane I, eds. i2b2
Workshop on Natural Language Processing Challenges for
Clinical Records. Washington, DC; 2006.
- Bramsen P, Deshpande P, Lee YK, Barzilay R. Finding
Temporal Order in Discharge Summaries. Symposium of the
American Medical Informatics Association, 2006. Washington, DC;
2006.
- Sibanda T, He T, Szolovits P, Uzuner Ö, A
Syntactically-Informed Semantic Category Recognizer for
Discharge Summaries. Proceedings of the Fall Symposium of
the American Medical Informatics Association; Washington, DC,
November 11-15, 2006.
In addition, Ozlem Uzuner ran the 2006 i2b2 Workshop
on Natural Language Processing of Clinical Data in
conjunction with the American Medical Informatics Association's
Annual Symposium. The workshop focused on two NLP challenges and
drew 18 team submissions for the two problems:
- De-identification of unstructured clinical text
- Determination of the smoking status of a patient from clinical
discharge summaries
In preparation for that workshop, we also created a simple web site to demonstrate
programs we had developed to help initially de-identify the data
that became the corpus for the challenge problems.
Last updated
11/21/2011 , Peter
Szolovits.