FindStruct

Many clinical documents are not well structured in the sense of being based on a formal markup language, but nevertheless have significant clues within their text that allow a heuristic computer program to recover much of their structure and thereby to create a version with the formal markup included. The FindStruct program is designed to do just this task. This program, and others like it, work best when the clinical documents are created using templates that indicate the document structure as embedded labels, headings, etc., or when documents have been generated by computer programs that were designed to support readability by human readers, but not by other computer programs. Fortunately, we have encountered a significant number of such sources of documents, and the current program is the result of generalizing methods from a handful of these.

FindStruct takes an XML file that specifies the clues to be sought in a text document to mark the beginnings (and occasionally the patterns) of text sections and subsections. It then processes a set of text documents, searching for the specified structures in each, and writes an XML file with the same content as the original document but with explicit markup having been inserted to indicate the structures that have been found. Optionally, it also recognizes numbered lists of items and outputs XML tags to indicate these.

Running the Program

The program is distributed as a Java .jar file that you need to download, findstruct.jar, which includes the compiled version of the program as well as the JDOM library, which it uses. To run it, you need to have installed on your system a contemporary Java Runtime Environment (JRE, or part of a JDK). I have tested this program with J2SE versions 1.4.2 and 1.5.0. (If you need these, they are available from Sun's Java site). Assuming your environment is set up so that the java executable is on your path, you can simply invoke FindStruct as:

java -jar findstruct.jar -t template.xml file1 file2 ...
where for template.xml, you need an XML file specifying the structure of the input files and hints on how to recognize that structure (see below). Without the -t argument, that specification should be in the file "default.xml". Under most operating systems, if you use wildcards in the specification of an input file name, the OS will expand the list of input files to include all those that match. Invoking this jar file with no arguments prints a brief instruction.

The source code is also available for use under the MIT license.
Note: the original program assumes that a section name can occur only onces inside a given medical record. We are making available under the MIT license an updated version of the code, that allows for a section name to occur multiple times. See updated .jar file and source code.

Specification

Here is part of a specification file that describes the textual clues to be found in an Emergency Department note as generated by a template-based system at a Boston-area hospital:

<root suffix=":" case_insensitive="yes">
<report>
<seen
pattern="Patient seen:\s*(\d{1,2}):(\d{2})\s*([AP]M)?"
attrs="HR,MIN,AMPM"/>
<source
pattern="Source of\s+history:([^\.]*)\."/>
<allergies/>
<pmh>
<medical_history/>
<prior_hospitalizations/>
<surgical_history/>
<meds/>
</pmh>
<family_history/>
<social_history/>
<immunizations/>
<developmental_history/>
<ros>
<general/>
<urinary/>
</ros>
<current_medications/>
<pe>
<appearance>
<skin/>
</appearance>
<head/>
<ears/>
<eyes/>
<nose/> ...

The root element is used only to carry additional information about the overall parsing strategy. The first 6 options map directly to features of the java.util.regex.Pattern class that implements our search strategy. In every section name, any "_" characters turn into a pattern that searches for one or more whitespace characters, and optionally a prefix and suffix are added. The lists, listPat, listPrefix and listSuffix attributes tell whether to try to recognize lists within subsections and how their beginnings are marked.

Each of these attributes may also occur on any XML element defining a subsection; in that case, it locally overrides the defaults or the values specified via the root.

case_insensitive="yes"
If matching of (sub)section names is to be case-insensitive. Defaults to "no".
canon_eq="yes"
Enables canonical equivalence for unicode characters. Defaults to "yes".
comments="no"
Allows comments to appear in the search string.
dotall="no"
Allows the pattern "." to match any character, including new lines. Defaults to "no".
multiline="yes"
Supports multi-line patterns by allowing "^" and "$" to match line beginnings and ends within the file. Defaults to "yes".
unix_lines="no"
Makes only the "newline" character (\n) represent an end of line. Defaults to "no".
prefix=
A string to prepend to each (sub)section name before matching it. For example, if each section is indicated by starting on a new line, then "^" as the value of prefix would assure that a section name would only be recognized as such when it appears at the beginning of a new line.
suffix=
A string to append to each (sub)section name before matching it. In our example, each such name must be followed by a colon in order to be recognized in the text.
lists="no"
If yes, then once a section has been recognized, its text is searched for numbered lists of elements, which provide further substructure. Default is "no".
listPat=
The regexp pattern that identified possible instances of the marker identifying the start of a list element. Default is "\d+", i.e., a sequence of one or more digits.
listPrefix=
The regexp pattern that identifies a prefix that must occur before the listPat in order to have it recognized. Default is "\s|^", i.e., either a space or the beginning of a line (assuming multiline is yes).
listSuffix=
The regexp pattern that identifies a suffix that must occur after the listPat in order to have it recognized. Default is "[):\.]\s", i.e., a period, close-parenthesis or colon followed by a white space character.

The root element must have only a single component element. Its title (in our case "report") is not searched for, but becomes the root element name of the output. Starting with that element, each element may have zero or more sub-elements. Pure text in the specification file is ignored, but substructures are searched for by searching for each specified substructure of each element, in any order. When the spec shows a substructure (e.g., surgical_history) within a higher level structure (e.g., pmh), then the header for that substructure will be searched for only within the region of the document identified as belonging to the higher level structure.

In most documents, there is a recognizable marker only for the beginnings of document sections, not their ends. Therefore, we normally interpret the text between the marker for one element and another to belong to that first element (or one of its substructures). Exceptions to this occur under three circumstances:

1. If the first sub-element of an element does not start at the beginning of its text, then the text between the marker for the element and the marker for the first sub-element is placed in the element itself.

2. When the pattern= attribute is used to parse a (sub)element, the region matched has both a beginning and an end. In this case, text following the end of this subelement to the beginning of the next element is placed in the element itself. In addition, text matched by the pattern is extracted by the group() commands and becomes associated with successive subelements named by the comma-separated names in the attrs= attribute. If the attrs= attribute is not given, then the first group (not the text matching the entire pattern) becomes the content of the whole (sub)element.

3. If the last item in a numbered list (see below) ends before the entire element that the list is part of ends, then any remaining text becomes part of the element itself, following the list. We consider a blank line to end the last item, even though (inconsistently), blank lines may occur within earlier items of a list.

Finding Lists

In addition to the structure parsing performed as described above, if lists="yes" is given, then we also parse any remaining text elements to see if they can be interpreted as containing numbered lists. The numbers must be preceded by white space (which includes a new line), consist of only decimal digits, end in ".", ":" or ")" followed by white space, and appear in successive numerical order. The last element of the list ends at the next blank line or at the end of the enclosing element, whichever comes first. This imposes fairly rigid criteria for finding lists, but avoids most annoying false positives.

Any numbered lists found are delimited in the output by XML tags as such:

<list>
  <item number="1">...</item>
  ...
</list>

Examples

The following two files contain specifications for semi-structured documents from two Boston-area hospitals:

Emergency Department notes from a pediatric hospital. This example shows a relatively deeply-nested set of elements and sub-elements.

Discharge summaries from a tertiary-care medical center. This shows a flatter structure, and also relies on recognizing lists of numbered elements.

Note that in both examples, the listed categories do not necessarily form an elegant taxonomy. E.g., "left knee" vs. "elbow". This is because the people or programs that created these data were not themselves carefully systematic. It is often the case that specification files like this need to be empirically tuned to local data, and may need to be re-tuned as notational practice changes.

Because of sensitivity about confidentiality of patient data, I do not include any example files processed by the program using these specifications.


Peter Szolovits 12/14/2005