Chunk-based Faculty Biography Parsing and Automatic Resume Generation

From Dahuawiki

Jump to: navigation, search

Authors: Jingjing Liu (jingl@mit.edu) and Dahua Lin (dhlin@mit.edu)

Contents

Motivation

People search has been an increasingly popular topic in research literatures. Currently, most of the search engines are based on keyword matching, though important, which is clearly not sufficient. Grammar structures also play an important role in understanding the sentences that describe a person. Hence, it is desirable to go beyond keywords and explore the approaches that take the grammar structures into account. We believe that by incorporating the structural information within a larger context, the semantics can be better interpreted. In this project, we aim at solving a problem that is particularly related to the research community -- automatic parsing and extracting information from a faculty biography and thereon generating a brief resume.

Despite of the great efforts devoted to natural language processing in past decades, automatic parsing of human languages in a generic context remains a very difficult task. As we focus on a particular domain - faculty biography, we can utilize the domain-specific knowledge to help solving the problem. The people who author faculty biography tend to use long sentences with complicated grammar structure, which makes the parsing of the whole sentence extremely difficult. However, we observe that most important information in a biography, such as when and where the faculty member graduated, what degrees he or she has acquired, and what positions the person holds, are typically encapsulated in chunks within regular patterns. Rather than parsing the whole sentence, which is clearly beyond our need, we are going to explore the strategy of partial parsing and try to extract and interpret the chunks encapsulating the most important information. When the desired information is obtained, a brief resume can be readily generated by arranging the information properly with a template.

Overview

Principle: Context-specific Chunking

In this project, the biography is automatically collected from the faculty members' personal homepages. Given a biography, our approach generates the corresponding resume by extracting related information from it.

The key idea of our approach is to extract the chunks encapsulating important information, which we call chunking. Compared to parsing the entire sentence, chunking has the following advantages:

  1. While the sentences in biography is rather complex, the structure of chunks are typically simpler and more regular, thus they are easier to extract. In our work, we found that with a small set of well-designed regular expression rules, most of the patterns can be correctly detected.
  2. Most important information is embedded in several classes of chunks. If these chunks are successfully extracted and correctly interpreted, the biography is mostly understood. While the way how these chunks are connected together is relatively less important in terms of interpretation.

However, chunking in itself cannot solve the whole problem, since it is still a very difficult task in a generic context. We further apply the following strategies to overcome the difficulties.

  1. Keyword-based sentence classification We classify the sentences into different categories based on keywords (or more exactly, the combination of keywords). The sentences in different categories play different roles in describing a person, for example, some sentences represent the education experience, and some others represent the research interests. Though the chunk structures vary much for different sentences in a biography, they conform to relatively regular patterns within each category.
  2. Noun subcategorization Chunking is based on the pattern of part-of-speech tags. However, the tags that simply classify words into nouns, verbs, adjectives and so on do not carry sufficient information to support accurate and reliable chunk extraction. Actually, in the particular context that we are talking about, some nouns convey significant information to identify the chunks. For example, the word University is likely to be in a school name. We refine the categorization by tagging some of the words with special tags that represent sub-categories.
  3. Chunk pattern interpretation After the chunks are extracted, the interpretation is not only based on the chunks themselves but also on their order in the sentence. This is important especially when interpreting a complicated sentence expressing multiple things. Consider the following example
     He got his M.S. and Ph.D degrees from MIT, respectively in 1985 and 1988. 
    Consider that we successfully extracted the following chunks M.S., Ph.D, MIT, 1985, 1988, then when did he get master and when doctorate? To answer this question, we need to pay attention to the order of the chunks.

In sum, our approach relies on chunking for information extraction. The techniques of sentence classification, noun subcategorization, and pattern interpretation are further applied to establish a domain-specific context, such that chunking can be done easiler and more reliably.

Overall Procedure

Our approach generates a resume from the input biography text through six stages:

  1. sentence segmentation and word tokenization,
  2. keyword-based sentence classification,
  3. tagging with noun subcategorization,
  4. chunking,
  5. interpretation based on the extracted chunks,
  6. resume generation.

Here, we use a simple example to illustrate how the approach works. Consider the following example

 ...... ...... She received an A.B. in Applied Mathematics in 1983 and a Ph.D. in Computer 
 Science in 1990, both from Stanford University. ...... ......

The procedure goes as follows:

Step 1

The entire biography comprises multiple sentences, which are first segmented. The sentence above is one of the segmented sentences. The segmentation adopts a simple method based on punctuations. Special attention is paid to avoid cutting acronyms. The sentence is then further divided into sequence of words by tokenization.

Step 2

The words are assigned part-of-speech tags, like follows
She/PPS received/VBD an/AT A.B./NN-DEGN in/IN Applied/VBD Mathematics/NN-SUB 
in/IN 1983/CD and/CC a/AT Ph.D./NN-DEGN in/IN Computer/NN Science/NN-SUB in/IN 1990/CD, 
both/ABX from/FROM Stanford/NN University/NN-UNIV ./.
Note that some words that may may convey significant information, like A.B., Ph.D., Mathematics, Science, and University, are respectively tagged with special noun categories: NN-DEGN (degree name), NN-SUB (subject name), NN-UNIV (the word meaning a university).

Step 3

Based on the following keywords appearing in the sentence: A.B., Ph.D, University, we can assume that this sentence is talking about education and degrees. The keyword-based classifier will classify the sentence to the education type. Generally, we don't make decisions just based on single keyword, but typically on the combination of multiple keywords for the sake of reliability.

Step 4

The chunks are then extracted based on POS tags. Our approach will extract the following chunks from the sentence
DGR: A.B.
MGR: Applied Mathematics
YR:  1983
DGR: Ph.D
MGR: Computer Science
YR:  1990
SCH: Stanford University
Each chunk is assigned a chunk tag to represent the role played by that chunk. In this example, DGR, MGR, YR., and SCH respectively stand for degree, major, year, and school. The parser that we use to detect the chunks is RegexParser in NLTK, which works by matching the sequence of POS tags with regular expressions.

Step 5

After the chunks are extracted, we can interpret the embeded information based on the patterns of the chunk types. In the above example, the chunk pattern is
 <DGR><MGR><YR><DGR><MGR><YR><SCH> 
In a typical situation, this pattern can be explained as follows. The two <DGR><MGR><YR> sub patterns respectively represent two different degrees, which are both acquired from the school specified in the final <SCH> chunk. We note that in many cases, a single chunk is sufficient to express an event, while in some more complex cases, such as the above one, a group of multiple chunks together refer to an event. Under such circumstances, we also use chunking to detect the multi-chunk patterns. This stage of chunking is at a higher level based on chunk-tags, while the previous stage is based on POS-tags. The interpretation is internally represented in form of an array of structs as follows
 [1]: degree = A.B.    major = Applied Mathematics     school = Stanford University
 [2]: degree = Ph.D.   major = Computer Science        school = Stanford University.

Step 6

Finally, a resume can be generated based on the interpretation. In this example, we can generate a table of education experience, or a list of degrees based on the information acquired from the sentence.

Data Collection

We use the personal homepages of the professors in CSAIL at MIT as the dataset for our experiments. CSAIL maintains an index page of these personal pages at http://www.csail.mit.edu/biographies/PI/biolist.php. In this dataset, there are personal webpages for 95 faculty members. Since the structure of the webpages is relatively regular, we use a simple script to extract the bibliography automatically, which simply takes the paragraphs following the title Biography.

Here is the biography for Erik Demaine

Erik Demaine is an Esther and Harold E. Edgerton Professor and Associate Professor in the 
Department of Electrical Engeering and Computer Science, and a member of the Theory of 
Computation group in the Computer Science and Artificial Intelligence Laboratory, at the 
Massachusetts Institute of Technology. He joined the faculty in 2001, and became an Associate 
Professor in 2005. He received his PhD in 2001 and MMath in 1996 at University of Waterloo, 
and his BSc in 1995 at Dalhousie University.

His research interests span much of theoretical computer science and mathematics, in particular 
with connections to algorithms. Major research foci include discrete and computational geometry 
(particularly folding and unfolding of linkages, paper, polyhedra, and proteins), advanced data 
structures, graph algorithms, and recreational algorithms (such as the complexity of combinatorial 
games).

We can observe the characteristics of faculty biography that we have discussed above, which typically consists of long sentences, and important information is mostly encapsulated in chunks. Though there are also other information available in the webpage, our project aims at parsing the biography text.

Design of Techniques

In this section, we introduce the design of the techniques in detail.

Sentence Segmentation and Word Tokenization

The first step to manipulate a biography is to divide the paragraphs into sentences, which are the unit of interpretation. We have observed that in typical biography, most sentences are enough to support interpretation independently. Hence, in this project, we are not going to explore the situation that require combining multiple sentences to reach a reasonable interpretation.

The paragraphs are divided based on puntuation. We locate the full points (.) and consider them as candidate positions to end a sentence. However, there are cases that the stop points are placed in names and acronyms, like Ph.D. and G. W. Bush. To avoid cutting through these phrases, we only cut sentences where the stop points follow non-capitalized characters or digits and are followed by spaces and then capitalized characters. Though simple, this rule works well for most biography that we processed.

For each sentence, we further convert it into a sequence of words by tokenization. The most natural way to accomplish this is to split the sentence at spaces. This does not work well since there is generally no spaces between a word and the punctuation that follows it. For example, the sentence "He is a professor." will be divided into four words: He, is, a, professor., in which the last word is a noun plus a stop point. The mixture of punctuations and words in tokens will cause a lot of confusion in further analysis. Another approach to tokenization is to cut at punctuations in addition to spaces, however, this method will divide an integral abbreviation into parts. For example, it will divide Ph.D into two tokens Ph and D, which is clearly undesirable.

In this project, we use regular expression based tokenizer (implemented in nltk.tokenize.regexp_tokenize), which extracts all tokens that match the specified regular expressions in order from left to right. In our design, we use the following regular expressions:

    Ph.D(\.)?|Sc.D(\.)?|B.Sc(\.)?|M.Eng(\.)?|M.Phil(\.)?|B.Tech(\.)?|Trans(\.)|Inc(\.)|Corp(\.)|Dept(\.)|
    "|([A-Z]\.)+([A-Z])(\.)?|[-'\w\d]+|[^\w\s]+

We can see that the regular expression is composed of multiple patterns, which are combined with "or". It means that a string that matches either of these patterns will be extracted as a token. In this expression, the patterns given in the first line are to match common acronyms that often appear in a faculty biography.The patterns in the second row are explained as follows:

  • ([A-Z]\.)+([A-Z])(\.)? is to match all acronyms composed of capitalized characters connected by dots, like M.I.T. or M.I.T
  • [-'\w\d]+ is to match typical words composed of letters, digits as well as apostrophe.
  • [^\w\s]+ is to extract punctuations.

Our tokenizer is implemented in the function bioparse.tokenize in file bioparse.py.

The tokenizer described above works pretty well in most cases. However, it may fail in some situation, e.g. the abbreviations containing non-capitalized letters (but not listed in the first row of patterns), such as M.Math (master of mathematics, it turns up for a couple of times in the biographies that we processed).

Keyword-based Sentence Classification

In many applications related to language understanding, keywords play a crucial role. The search engines that are heavily based on keyword matching, like Google, are the most successful applications. Though the approaches that merely rely on keywords have severe limitations, keyword matching's effectiveness has been proved by innumerable practical systems. In this project, we aim at developing an approach beyond keyword matching, however, we still make use of the information provided by keywords to help sentence parsing and interpretation.

We have discussed above that chunking is a very difficult problem in generic context. To address this problem, we develop chunking rules for different context respectively. Concretely, we first use keywords to classify a sentence to a particular category, and then invoke the corresponding chunking rules to extract the chunks.

According to our observation, important information is often carried by the sentences in the following categories (types):

type description keywords
education the education experience, i.e. what degrees he/she has received, in which major, which school, and when received, degree, B.S., M.S., Ph.D, Master, Doctor, from, University, ...
research the research interests, i.e. the research areas/topics that he/she is interested in research interests, focus, works on, area of research, interests include, ...
award the awards or honors that he/she has received recipient, winner, award, medal, prize, Distinguished, Outstanding, ...
position_univ the positions with in universities/schools, i.e. professor or scientist. Assistant Professor, Associate Professor, Lecturer, Department, Computer Science, affiliated with, ...
position_acad the positions in other academic organizations, i.e. editorial boards of journals, program committees of conferences, and academic associations, etc. chair, association, conference, journal, program committee, Fellow, IEEE, ACM, ...
position_org the positions in non-academic organizations such as companies. Company, Corp., Inc., officer, advisor, consultant, ...

Some examples of these types are given in the following table

Type Examples
education
He received his PhD in 2001 and MMath in 1996 at University 
of Waterloo, and his BSc in 1995 at Dalhouse University.

Brian Williams received his S.B., S.M. and Ph.D from MIT in 
Computer Science and Electrical Engineering in 1989.
research
His research interests include Long-term evolution of the orbits 
and spins of the planets and natural satellites; qualitative 
behavior of dynamical systems; chaotic behavior; and dynamics 
of planetary rings.
award
He was the recipient of the 1976 Shannon Award of the IEEE 
Information Theory, and of the 1977 Education Medal of the IEEE.
position_univ
Chris Terman is a Senior Lecturer in the EECS Dept. and a 
member of the Computer Architecture Group in CSAIL.

Saman Amarasinghe is currently an Associate Professor in the 
Department of Electrical Engineering and Computer Science 
at Massachusetts Institute of Technology and is the co-leader 
of the MIT Raw Architecture group.
position_acad
He has chaired and served on the program committee of many 
meetings sponsored by ACM and IEEE.

Victor is a Fellow of the Acoustical Society of America, 
and US. National Academy of Engineering.
position_org
He also served as the Vice President of Technology at Symbolics 
Inc.

He is also co-founder and Chief Technical Officer of iRobot 
Corp.



To classify sentences reliably, we usually rely on combination of keywords to judge the type of a sentence, rather than merely on individual keywords. Each sentence type is characterized by a set of rules, each rule comprises multiple keywords or key phrases. When a sentence contains all the keywords/key phrases of a rule, it is considered to match that rule, and a sentence is classified to a particular type when it matches either of the rules characterizing the type.

A full list of keyword combination rules are given in the submitted package: code/keywords.txt. Actually, the sentence classifier is constructed by reading that file. The sentence classification is implemented in the kwclass module in kwclass.py file.

Part-of-Speech Tagging

Considering the complexity of the problem that we are facing, we use multiple taggers to tag the classified sentences. The taggers are listed in order of priority from high to low as follows:

1. domain-specific tagger. A look-up tagger based on a specific keyword-specification file (code/tagclass.txt in the submitted package). The tagger locates the nouns listed in the keyword-tagging file and tags them with the corresponding tag name. By tagging important nouns that convey important information for chunking and interpretation with special tags, we implement noun subcategorization. We will see later that these sub category tags play a crucial role to help chunking. The following table shows several important sub-categories of nouns (please refer to the tagclass.txt file mentioned above for the full list):

tag representative words
NN-UNIV University, Institute, College, ...
NN-DEGREE Degree, Degrees
NN-DEGN-A M.S., Ph.D., M.Eng., M.Phil., B.S., B.Sc., ...
NN-AWARD Award, Medal, Prize, Fellowship, Scholarship, ...
NN-PROF Professor, Lecturer, Prof.
NN-HEAD Director, Head, Leader, Chairman, Founder, ...
NN-AORG Committee, Board, Association, Academy, ...
NN-SUB Mathematics, Physics, Chemistry, Science, Engineering, ...
... ...

2. Bigram tagger. A tagger that tags a word based on the word that precedes it. The tagger is trained on the brown dataset in NLTK.

3. Unigram tagger. A tagger that tags a word according to the frequencies of the tags for the word in the training set, which is also trained on the brown dataset in NLTK.

4. Regexp tagger. A regular-expression-based tagger that tags the works like ‘—ed’ to be ‘VBD’ and the words composed of digits as ‘CD’.

5. Default tagger. The last backoff tagger that tags all remaining words to be default NN.

Each tagger is the backoff of the tagger with higher priority. Namely, the tagging process first applies the first tagger. If there are untagged words left, it will apply the second one as backoff. If there’s still untagged words remained, the third tagger will be applied. This process continues until the fifth default one is applied, which tags all remaining unknown words to NN.

Chunking and Interpretation

Given the classified and tagged sentences, the next step is parsing, which is the most important step in our approach. We use chunk-parser in this project, as what we are interested is partial information within the chunks rather than the grammatical structure of the entire sentences. For each class of sentence, we build a set of chunking grammars for detecting the chunks respectively. The design of these grammars will be detailed in the following.

Moreover, since the interpretation of a sentence is closely related to the chunks, we will discuss interpretation of the chunks in this section as well, which is more convenient than talking about interpretation in a separated section.

Education Experience

The sentences on education experience are classified to the type education. From the example in the table above, we see that sentences in this type typically involves information like what degrees the person has received, what are the majors, as well as where and when he/she got the degrees. From these sentences, we intend to derive the following five types of chunks.

1. Person Name

 PNNM: {<NP>?<NN>+<.>?<NN>?<VBD|VBZ|BEDZ>}
       }<VBD|VBZ|BEDZ>{

The person name typically consists of a sequence of nouns with tags <NP> or <NN> in the following pattern <NP>?<NN>+<.>?<NN>?. Representative examples are "Fernando/NN Corbato/NN", and "R/NN ./. Rivest/NN". In a typical narrative sentence, the person name often serves subjective and is followed by verbs or auxiliary verbs that are tagged with VBD, VBZ, or BEDZ. Therefore, we can use <NP>?<NN>+<.>?<NN>?<VBD|VBZ|BEDZ> to extract the subjective noun phrases in combination of verbs, and use <VBD|VBZ|BEDZ> to exclude the verb part.

2. Degree Name

 DGR: {<NN-DEGN-A>}
      {<NN-DEGN-L>}

It is relatively straightforward to extract degree names, as they are typically single tokens tagged with sub category for degree names: NN-DEGN-A (e.g. Ph.D, M.S., B.Sc., etc) and NN-DEGN-L (e.g. Doctor, Master, Bachelor, etc).

3. The Name of School (University)

 SCH: {<NN-UNIV><IN><NN|NP><IN|,><NN-UNIVN>}
      {<NP|NN><NN-UNIV><IN><NN.*>}
      {<NP|NN><NN-UNIV><IN><JJ-TL><NN|NP>}
      {<NN-UNIVN><NN-UNIV>?}
      {<NP|NN>+<NN-TL>?<NN-UNIV>} 
      {<NN-UNIV><IN><NN.*|NP.*>}
      {<NN-TL><IN-TL><NN><IN><NN>}
          
      {<FROM><AT>?<NN>+}
      }<FROM>{

The names of universities are generally quite regular, which mostly follow the patterns listed in the following table

Pattern Examples
<NN-UNIV><IN><NN|NP><IN|,><NN-UNIVN>
University of California, Berkeley
<NP|NN><NN-UNIV><IN><NN.*>
Massachusetts Institute of Technology
<NP|NN><NN-UNIV><IN><JJ-TL><NN|NP>
Flinders University of South Australia
<NN-UNIVN><NN-UNIV>?
Stanford University, Harvard
<NP|NN>+<NN-TL>?<NN-UNIV>
Carnegie Mellon University, Brown University
<NN-UNIV><IN><NN.*|NP.*>
University of Michigan

We can see that the above 6 rules all rely on the existence of the keywords in the sub-category NN-UNIV (e.g. University, Institute) to identify the university name entities. Though these rules work for most schools, there remain some situations that they fail to find the school names. Some universities outside USA just don't use these keywords to represent universities, such as Eidgenssische Technische Hochschule, which is a famous school (ETH) in Europe. One way to work around this is to add special rules for them, but this way doesn't generalize well. To better account for this, we make use of grammatical structure.

We find that the phrases that look like xxx of xxx in/of xxx is likely to be a school name. This rule is surely not correct in a generic context, however, with in a sentence describing education, this is probably the case. Here, we can see how a restricted context established by sentence classification helps chunking. Based on this, we derive the rule {<NN-TL><IN-TL><NN><IN><NN>}.

In addition, we find that in an education-typed sentence, the noun phrases following the word "from" are often the school name. (Yes, this is true in such a particular context.) So we finally design the rule {<FROM><AT>?<NN>+} to extract from + noun phrase patterns and exclude the word from by the rule }<FROM>{. Owning to this rule, it can now detect Eidgenssische Technische Hochschule even when the model can recognize none of these three words (it uses the default tagger to tag all of them as NN).

4. The Name of Major (Subject Name)

 MJR: {(<NN.*>|<VBD>)?<NN-SUB><CC>(<NN.*>|<VBD>)?<NN-SUB>}
      {(<NN.*>|<VBD>)?<NN-SUB>}
     
      {<IN><JJ|VBN>?<NN>+<NN-TL>?<FROM><AT>?<SCH>}
      {<DGR><IN><JJ|VBN>?<NN>+<NN-TL>?}
      {<DGR><IN><NN>+<CC><NN><NN-TL>}
     
      {<IN><JJ>?<NN><IN|,>}
      {<IN><NN>+<CC><NN><NN-TL>?<FROM><AT>?<SCH>}

      {<DGR>?<IN><NN-TL><NN>}
      {<DGR>?<IN><NN><NN-TL>}
      {<IN><NN><NN-TL><FROM><AT>?<SCH>}
     
      }<DGR>?<IN>{
      }<FROM><AT>?<SCH>{

We use the following table to show how these rules match the names of majors.

Pattern Examples
(<NN.*>|<VBD>)?<NN-SUB><CC>(<NN.*>|<VBD>)?<NN-SUB>
Electrical Engineering and Computer Science
(<NN.*>|<VBD>)?<NN-SUB>
Computer Science, Applied Mathematics, Mathematics

These two rules utilize the central words in NN-SUB category (e.g. Engineering, Science, Mathematics) to locate the names of majors. While the remaining rules take advantage of the positional relation with school names and degree names to locate the majors. Since major names are typically more irregular, and it is possible to have unknown words. So the assistance from context is helpful. After finding the phrases in combination with contextual words, we use the rules }<DGR><IN>{ and }<FROM><AT>?<SCH>{ to exclude the surrounding parts.

5. Year

 YR:    {<CD>}

Locating the years are simple, just find the numbers in the sentences. We find that in all the education-typed sentences that we have ever seen, the only role that a number can play is to represent a year.

Interpretating education experience

Each education-typed sentence is interpreted as a list of degrees, with each degree represented by four fields, including the degree name, major, school, and year.

It is often the case that multiple degrees are tangled together in a sentence, which impedes correct interpretation. To address this, we need to analyze the ordering of the detected chunks. We summarize the typical chunk-order patterns as follows

PAIR1:            {<DGR><MJR>?<DGR><MJR>?<YR><YR><SCH>}
                  {<DGR><MJR>?<DGR><MJR>?<SCH><YR><YR>}	

PARALLEL:         {(<DGR><MJR><YR>*)(<DGR><MJR><YR>*)+<SCH><YR>?}
PARALLEL2:        {(<DGR><YR>*)(<DGR><YR>*)+<MJR><SCH><YR>?}
		  {(<DGR><YR>*)(<DGR><YR>*)+<SCH><MJR><YR>?}
		  {(<DGR><YR>*)(<DGR><YR>*)+<SCH><YR>?}

SINGLE:		  {<DGR><MJR>?<SCH><YR>+}
		  {<DGR><MJR>?<YR>+<SCH>}
	          {<DGR><MJR>?<SCH>}			

PAIR1-NS:         {<DGR><MJR>?<DGR><MJR>?<YR><YR>}
		  {<DGR><MJR>?<DGR><MJR>?<YR><YR>}	

PARALLEL-NS:	  {(<DGR><MJR><YR>*)(<DGR><MJR><YR>*)+<YR>?}
PARALLEL2-NS:	  {(<DGR><YR>*)(<DGR><YR>*)+<MJR><YR>?}
	          {(<DGR><YR>*)(<DGR><YR>*)+<MJR><YR>?}
		  {(<DGR><YR>*)(<DGR><YR>*)+<YR>?}

SINGLE-NS:        {<DGR><MJR>?<YR>+}
                  {<DGR><MJR>?<YR>+}				
                  {<DGR><MJR>?}

We again apply chunk detector to the detected chunks to figure out their ordering patterns, and take corresponding actions to interpret the pattern. For example <DGR><MGR><YR><DGR><MGR><YR><SCH> typically mean two degrees obtained in the same school but with different majors and years. We carefully analyze the example sentences to decide how to interpret each of these patterns. Please refer to biointerp.py for details.

Research Interests

The sentences to express research interests are not that regular as the education-typed sentences. People use various ways to tell a research topic. The only commonality that we found among the phrases representing research interests is that they are all noun phrases. And it is true that in a research-interest-typed sentence, most of the noun phrases therein is about a research area or research topic. However, extracting all noun-phrases is problematic. Let's look at the following examples

 His research interests include Long-term evolution of the orbits and spins of the 
 planets and natural satellites, qualitative behavior of dynamical systems, chaotic 
 behavior, dynamics of planetary rings.

 Prof. Kaebling has done substantial research on designing situated agents, mobile 
 robotics, reinforcement learning, and decision-theoretical planning. 

Among all the noun phrases appearing in these two sentences, three of them are not names of topics or areas: research interests, Prof. Kaebling, and substantial research, which are in two types: the phrases just bearing the meaning of research interests, and the phrases representing the professor. Exceptions in all research-typed sentences mainly fall in these two classes.

So the strategy that we take is to extract all noun phrases and exclude those in these two types. Since the word research is tagged as NN-RSCH and the words Professor and Prof. are tagged as NN-PROF in the tagging stage with sub-categorization. We can easily locate these phrases to preclude them from the detection results.

The grammar rules are given as below

 PEOPLE:   {<NN-PROF><.>?(<NN.*>)*}

 RSCH-H:   {(<NN.*|JJ.*>*)<NN-RSCH|NN-FOCUS|NN-FIELD|NN-GRP|NN-ACT>}

 RSCH:	   {(<AT|NN.*|JJ.*|VBD|VBG>(<IN|IN-TL|TO|AT|NN.*|JJ.*|VBD|VBG|CC>*))?<NN.*|VBG>}
           }<NN-PROF>{
           }<VBD><IN|IN-TL|TO|FROM>{

Here, PEOPLE is to match those phrases like Professor David, while RSCH-H is to match those like research interests, research fields, etc. Since these two rules are applied first, only the phrases that can not be captured by them will be matched by RSCH, which can be considered as to match all noun phases. After the chunks in these types are extraced, we only retain the RSCH chunks.

Then the sentence is interpreted as a list of research topics with each topic corresponding to a RSCH chunk.

Awards and Honors

The chunking grammars for awards are given as follows.

 AWARD:   {<N.*|JJ.*|RB|VBG|CD>*(<CC><N.*|JJ.*|RB|VBG|CD>)*<NN-AWARD|NN-AWARDS>
           <NN.*>*(<IN|IN-TL|VBG><AT>?(<N.*|JJ.*|RB|VBG|CD>)*<N.*>)+}         
          {<N.*|JJ.*|RB|VBG|CD>+(<CC><N.*|JJ.*|RB|VBG|CD>)*<NN-AWARD>}

The extraction of awards is relatively easy, since a phrase representing an award typically contain a central world in the sub category NN-AWARD(e.g. Award, Medal, Prize, etc). Any noun phrases with NN-AWARD, no matter how many other descriptive words preceding or following it, is likely to represent an award, especially in the context that the sentence is award-typed. Here are just several examples:

  Prof. Rinard received a National Science Foundation Early Career Development 
  Award in 1997.

  He was the recipient of the 1976 Shannon Award of the IEEE Information Theory, 
  and of the 1977 Education Medal of the IEEE.

Awards are typically with a central word in special sub category, however, it is not always the case for honors. Let's look at the following example:

  Madhu Sudan was named a Distinguished Alum by the CS Division of the University 
  of Carlifornia at Berkeley in 2003 and by the Indian Institute of Technology at 
  Delhi in 2004.

  Prof. Leighton's technology achievements at Akamai earned him recognition as one 
  of the Top 10 Technology Innovators in U.S. News.

In these sentences, the words that characterize the honor are not nouns, but some adjectives like Distinguished, which express the excellence, and are tagged as JJ-EXCEL in the tagging stage. Another way to locate an honor is to find the patterns like one of (<CD><IN>), such a pattern does not necessarily refers to an honor in generic circumstances. But in a sentence classified as award-typed, this is probable. Based on this analysis, we derive the rules for extracting honors as below:

 HONOR:   {<JJ-EXCEL><NN.*>+(<IN|IN-TL><AT>?(<N.*|JJ.*|RB|VBG|CD>)*<N.*>)+}
          {<CD>(<IN|IN-TL|VBG><AT>?(<N.*|JJ.*|RB|VBG|CD>)*<N.*>)+}

With the award and honor chunks detected, the interpretation of the sentence is given in form of a list of awards and honors.

Positions in Universities

The following is a typical sentence describing a position in university.

  He is currently a professor of Electrical Engineering and Computer Science at MIT.

This sentence expresses a position with the following three aspects of information: the position is professor, the department is Electrical Engineering and Computer Science, and the school is MIT. For each position, we hope extract information in these three aspects (at least some of them). We thus design three sets of grammars respectively for them.

1. Position

The grammar to extract the name of position is given below

  POSITION:   {(<NN|NN-RSCH>*)<NN-PROF|NN-HEAD|NN-POS|NN-MEMBER>}

We can see that the rule is driven by the central word in sub categories NN-PROF(e.g. Professor, Lecturer), NN-HEAD(e.g. Director, Chair, Dean), NN-POS(e.g. Scientist, Researcher), NN-MEMBER(e.g. Member). The rule also allows a some descriptive words precedinig it, in order to match the phrase like Principal Research Scientist and Assistant Professor.

2. DEPT

The grammar to extract the departments is given as follows

 DEPT:       {<NN-DEPT><IN>(<NN|NN-SUB>+)(<CC>(<NN|NN-SUB>)+)}
             {(<NN.*>+(<CC><NN.*>+)*)<NN-DEPT|NN-GRP|NN-LAB>}
             {<NN-LAB-N>}

             {(<NN.*>*)<NN-SUB><CC>(<NN.*>*)<NN-SUB>}
             {(<NN.*>*)<NN-SUB>}

We use a table with examples to show how these rules work.

Pattern Examples
<NN-DEPT><IN>(<NN|NN-SUB>+)(<CC>(<NN|NN-SUB>)+)
Department of Computer Science and Electrical Engineering
(<NN.*>+(<CC><NN.*>+)*)<NN-DEPT|NN-GRP|NN-LAB>
Computer Science and Electrical Engineering Department
<NN-LAB-N>
CSAIL
(<NN.*>*)<NN-SUB><CC>(<NN.*>*)<NN-SUB>
Computer Science and Electrical Engineering
(<NN.*>*)<NN-SUB>
Applied Mathematics, Mathematics

The first two rules are to match the phrases with the central word in NN-DEPT(e.g. Department), while the third rule is to match particular acronyms, and the remaining two rules are to match the phrases without the word Department or Dept., which are then detected based on the words in NN-SUB(e.g. Mathematics, Science)

3. SCH

SCH:       {<NN.*|NP>+<NN-UNIV>(<IN>(<NN.*|NP>+))*}
           {<NN-UNIV><IN>(<NN.*>+)}
           {<NN-UNIVN><NN-UNIV>?}

The extraction of school names are mainly hinged on the central words in <NN-UNIV> , just as how we extract schools for education-typed sentences as discussed above.

With the above chunks extracted, the interpretation can be obtained in form of a list of positions.

Positions in Academic Organizations

In addition to being professors in universities, the faculty members often have other academic positions, such as journal editors, fellows of academic associations, etc. We also intend to extract these positions from the biography.

According to our analysis, we can typically extract the following four types of chunks: academic positions, academic organizations, the names of journals, and conferences. The grammar rules are given in the following:

 ACADPOS:	{<AT>?(<NN.*|JJ.*|VBG|VBD|CD>)*<NN-HEAD|NN-MEMBER|NN-EDITOR|NN-FELLOW>
                 (<IN|IN-TL|VBG>(<AT|AT-TL|AP|NN.*|JJ.*|VBG|VBD|CD|>)+((<CC>|<,>)?
                 (<AT|AP|NN.*|JJ.*|VBG|VBD|CD|>)+)*)+}
				
                {<AT>?(<NN.*|JJ.*|VBG|VBD|CD>)+<NN-HEAD|NN-MEMBER|NN-EDITOR|NN-FELLOW>}

 ACADORG:        {((<AT|NN.*|JJ.*|VBG|VBD|CD>)*)<NN-AORG|NN-AORGS>
                 (<IN|IN-TL|VBG>(<AT|AT-TL|AP|NN.*|JJ.*|VBG|VBD|CD|>)+
                 ((<CC>|<,>)?(<AT|AP|NN.*|JJ.*|VBG|VBD|CD|>)+)*)*}
		 {<AT>?<NN-ORGN>(<AT|NN.*|IN|IN-TL>)*}

 JOURNAL:        {<AT>?(<NN.*|JJ.*|VBG|VBD|CD>)*<NN-JNL>
                 (<IN|IN-TL|VBG>(<AT|AT-TL|AP|NN.*|JJ.*|VBG|VBD|CD|>)+
                 ((<CC>|<,>)?(<AT|AP|NN.*|JJ.*|VBG|VBD|CD|>)+)*)+}

		{<AT>?(<NN.*|JJ.*|VBG|VBD|CD>+)<NN-JNL>}
				
 CONFERENCE:     {<AT>?(<NN.*|JJ.*|VBG|VBD|CD>)*<NN-CFN>
                 (<IN|IN-TL|VBG>(<AT|AT-TL|AP|NN.*|JJ.*|VBG|VBD|CD|>)+
                 ((<CC>|<,>)?(<AT|AP|NN.*|JJ.*|VBG|VBD|CD|>)+)*)+}

	        {<AT>?(<NN.*|JJ.*|VBG|VBD|CD>+)<NN-CFN>}

We can see that all these rules are driven by central words, respectively in the suib categories: <NN-HEAD|NN-MEMBER|NN-EDITOR|NN-FELLOW>(e.g. chair, member, editor, Fellow), <NN-AORG|NN-AORGS>(e.g. committee, board, association), <NN-JNL> (e.g. Journal, Transaction), and <NN-CFN>(e.g. Conference).

While the preceding pattern <AT>?(<NN.*|JJ.*|VBG|VBD|CD>)* is to allow the matching of preceding descriptive words, while the succeeding pattern (<IN|IN-TL|VBG>(<AT|AT-TL|AP|NN.*|JJ.*|VBG|VBD|CD|>)+ is to match the prepositional phrases that follow the central world, and ((<CC>|<,>)?(<AT|AP|NN.*|JJ.*|VBG|VBD|CD|>)+)*)+ is to allow the word "and" appear in the prepositional phrases.

The following table gives several examples of these chunks:

Chunk Examples
ACADPOS
a member of the American Academy of Arts and Sciences,
an associate editor of Genetic Programming and Evolvable Machines
ACADORG
the editorial board of Evolutionary Computation, 
National Academy of Engineering,
ACM, IEEE
JOURNAL
The Journal of Functional Programming,
IEEE Trans. on Pattern Analysis and Machine Intelligence
CONFERENCE
ACM Symposium on Theory of Computing,
Cognitive Science conference

Positions in Other Organizations

We finally try to extract the positions at other organizations, such as commercial companies.

The grammars are given as follows

  POSITION:    {<AT>?(<NN.*|NP|JJ.*|CD|VBG|VBD>)*(<NN-HEAD|NN-POS|NN-MEMBER>)
                (<IN|IN-TL|VBG>(<AT|AT-TL|AP|NN.*|NP|JJ.*|VBG|VBD|VBN.*|CD|>)+
                ((<CC>|<,>)?(<AT|AP|NN.*|JJ.*|VBG|VBD|CD|>)+)*)+}
			
               {<AT>?(<NN.*|NP|JJ.*|CD|VBG|VBD>)+(<NN-HEAD|NN-POS|NN-MEMBER>)}
			
  ORG:	       {(<AT|NN.*|NP|JJ.*|VBG|VBD|CD>)*<NN-AORG|NN-AORGS|NN-COPR>
                (<IN|IN-TL|VBG>(<AT|AT-TL|AP|NN.*|NP|JJ.*|VBG|VBD|VBN.*|CD|>)+
                ((<CC>|<,>)?(<AT|AP|NN.*|JJ.*|VBG|VBD|CD|>)+)*)+}

	       {<AT>?(<NN.*|NP|JJ.*|VBG|VBD|CD>)+<NN-AORG|NN-AORGS|NN-COPR>}
	        {<AT>?<NN-CORP-N><NN.*|NP>*}

Two types of chunks are extracted, respectively represent positions and organizations. As in position_acad, the rules are mainly central word driven, with preceding and succeeding patterns to help extending the matches to surrounding descriptive words.

The results are interpreted as a list of positions and organizations.

Summary of Chunking Grammar Design

We summarize several key aspects that we need to pay attention to in designing chunking rules:

  • There are two ways in designing the rule, based on central words or based on surrounding context. Generally, the former is more reliable. However, central words are not always available, and context-based rules are necessary in such situation.
  • The central words typically play the role as a pivot in locating chunks, however, information is not in the central words, but rather in the descriptive words. For example, in the phrase Journal of Functional Programming, the central word Journal in itself does not convey much information, what we want to know the what journal it is. Hence, it is necessary to endow the rules with the capabilities to extend the matches from the central words to surrounding descriptions. The grammatical structure of surrounding descriptive parts is thus important in designing the rules.
  • The order of rules are important. The parser apply the rules one by one in order. If a part of the whole phrase is matched by a simpler rule that comes earlier, then the entire phrase will not be matched again. A rule of thumb is to place long and complicated rules at first and let them to match chunks as long as possible, and use simpler rules as backoff to detect the remaining short chunks.
  • When the sentence is tangling multiple things, such as those in education-typed sentences, it is desirable to apply the chunking strategy based on chunk tags to detect the patterns in the chunk sequence, which helps to clarify the interpretation.

Resume Generation

When the information is all extracted and stored in the internal data structures as described above, we can readily generate the resumes. The procedure goes as follows, the information is first output to a xml file, which is just for keeping information and is display independent. Then a XSL (XML stylesheet) is applied to convert the XML file to a resume in form of HTML file.

For the biographies of CSAIL faculty, their corresponding XML output files and HTML output files (resumes) are respectively placed in results/xmls and results/htmls directories in the submitted package. It is worth noting that, they are just brief resumes summarizing extracted information, rather than full functional resumes. Through these resumes, we intend to demonstrate what we have extracted from the biography text.

Implementation

We implemented the approach with python 2.5. Some of the functionalities are implemented based on the Natural Language Toolkit (NLTK). Concretely, we use NLTK's regular expression based tokenizer and RegexpParser for chunk parsing. However, the rules are designed by ourselves. We also implement other aspects othe approach,including extraction of biography from the CSAIL webpages, sentence segmentation, keyword based classification, domain-specific tagging, chunk-based interpretation, as well as resume generation.

All codes are placed in the code directory in the submitted package, which include the following modules:

  • gtext: extraction of biography from CSAIL webpages, file I/O for biography files
  • bioparse: for word tokenization, POS tagging, and chunk detection
  • kwclass: keyword-based sentence classification
  • biointerp: interpretation based on extracted chunks
  • bioanalyze: whole procedure integration, and result organization
  • sentproc: per-type chunking result output
  • bioproc: the main script that takes biography as input and outputs resumes and result traces

We have also written a series of scripts for different batch processing tasks. In total, there are 15 python source files with about 2500 source lines.

All chunking grammars are in the data/chunk directory of the submitted package, and all CSAIL biographies extracted from webpages are in data/bio_txt. The results for these biographies are in the results directory, which contains not only generated resumes, but also interpretation results in XML format, as well as traces of intermediate results. All these results can be linked from the resume pages.

Results

Resumes

We generate resumes for all CSAIL biographies. Please enter the following index page to browse the generated resumes. Index Page of the Results

Each resume page maintains the links to the previous resume and the next resume for the convenience of navigation. In additon, they also maintain links to intermediate results in both HTML form (more friendly) and plain text form (containing more information).

Generally speaking, the interpretation results are acceptable. However, errors are unavoidable. Since the chunk parser is based on regular expression, it may not be generalized well to unseen structures. We have tried our best improve the design, which currently can work well for the biographies with similar styles to those of CSAIL faculty member.

Per-type Results

We also organize a set of sentences for each type, and generate parsed results for these sets, which can be browsed through the index page at Index Page of Per-type Results

From the results, we can see that our approach works pretty well in extracting the meaningful chunks from biography. The chunks in a variety of structures can now be detected with a set of well-designed rules.