6

6.872/HST950 Problem Set 3

Handed out: Thursday, April 1, 2004

Due: Thursday, April 8, 2004

In an earlier lecture, we had outlined a “theory of record linkage” (the full paper is linked from our class schedule page) that tells us, in principle, how to do probabilistic matching of various features of two objects in order to decide whether they are likely to be the same object. Briefly, the theory is as follows. I have interspersed questions for you to answer with the description.

Given two purported objects (e.g., patients), o1 and o2, it is either the case that o1=o2 or

that they are distinct individuals. For example, our records contain a patient file for Raul

P. Szolovits of 123 Main Street, Boston, MA 02131; a new patient arrives claiming to be

Peter Szolovits of 123 Main Street, Boston, MA 02113.

Among all the observations we might make of o1 and o2, we select a certain set of

features fi(o) that we agree will be of interest. For example, we might choose last name,

first and middle names, street address, city, and ZIP code.

For each pair of features fi(o1), fi(o2), we can compare the probability that one would

observe fi(o1), fi(o2) in either of the two cases of step 1. For example, assuming that half the hospital’s patient population have home addresses in Boston, then P(fcity(Raul), fcity(Peter)|~same) is ½ * ½. = ¼. By contrast, if these two records belong to the same person, then we would just expect that the probability that person lives in Boston is ½. Thus, the likelihood ratio . Further, if 1% of people in the city live on Main St, then We may get an additional likelihood ratio of 1000 (say) for the address, 123, and another factor of, say, 1.5, for both states being MA. These are both estimates, and answer the question what fraction of all addresses is 123, or what fraction of individuals like in MA. If our initial database contains records on 1M individuals, then we might argue that the a priori odds are essentially . If we assume conditional independence of each of the feature pairs from each other e.g., if we believe that you are no more likely to get

matching street numbers on Main Street than on Sunset Boulevard, then the posterior

estimate is.

Q1:

We have two records as follows. Please estimate the likelihood ratio that these are actually of the same person. You can assume any probabilities you need for the calculation, but please suggest a practical way to estimate these probabilities, from any resource available to the hospital, or from a small study. The way by which you estimate these probabilities is as important as, if not more than, the likelihood ratio calculation. Assume independence between the variables and ignore the possibilities of typos and errors.

Variables	Record1	Record2
First name	George	George
Last name	Bush	Bush
Gender	M	M
Date of birth	Jan 2, 1914	Jan. 2, 1941
Address	123 Main St.	123 Main St.
City	Midland	Midland
State	TX	TX

In reality, we must also consider the effect of typos and errors. For instance, a zip code 02113 can be miswritten as 02131, a common transcription error. Another possible problem is that the same person may use different first names, e.g. Dave and David, or people may be referred to by their middle names. Also sometimes people put middle names/initials on the registration form, but sometimes they don’t. Treating such mismatches can be challenging problems.

Q2:

Consider the possible typo that the dates of birth can actually be the same but with a transposition error. Please re-estimate the likelihood ratio of the above two records. You can assume any probabilities you need for the calculation, but please suggest a practical way to estimate these probabilities, including the possibility of transposition errors, from any resource available to the hospital, or from a small study. The way by which you estimate these probabilities is as important as, if not more than, the likelihood ratio calculation. Assume independence between the variables.

The assumption of conditional independence among pairs of (mis)matching features is not really appropriate under some circumstances, no matter how convenient it may be. For example, different ethnic groups tend to have different last names (e.g., you might be more tempted to look for my ancestry among Central Europeans than Chinese, Hawaiians, Welsh or Hispanics). But the distribution of first names often follows similar ethnic patterns. Therefore, if you compare two records each with the name “Raul Gonzales”, the likelihood ratio should almost certainly not be as high as the product of the likelihood ratios for “Raul” and for “Gonzales”. Intuitively, once I learn that two records have a Hispanic last name, then a further match on a Hispanic first name should be less impressive than that same further match would be in conjunction with a Slavic last name (because that combination is much more rare). The census bureau (www.census.gov) does not, to my knowledge, publish statistics on name distributions in different ethnic groups or on the correlations between first and last names.

If such distributions are available, however, we can make a first-order adjustment of such dependencies with a simple Bayesian model. Assuming P (first | ethnic) and P (last | ethnic) are conditionally independent given the ethnic group, we have .

In Partners EMPI, three groups of variables were used for patient matching:

– Last and First names

– First name, Date of Birth and Gender

– Social Security Number

Q3:

Are there any possible dependencies among these variables, besides that of first names and last names based on ethnicities? If so, please suggest a mathematical model to make a first-order adjustment.

The next two questions are based on an exercise of bioinformatics database usage. In the course of this exercise, it is important that you cite the bioinformatics databases you use and that you record the steps you perform to reach your answer.

A region of chromosome 7 was sequenced in a cohort of individuals (of unknown ethnicity) affected by pulmunary disease, resulting in the following DNA sequence:

TGGCTAACAAAACTAGGATTTTGGTCACTT[C/T]TAAAATGGAACATTTAAAGAAAGCTGACAA

Note that the central location exhibits a C to T polymorphism. In particular, 36% of the subjects presented the T allele of the SNP. By contrast, in a cohort of comparable size of healthy subjects, the frequency of the T allele was found to be at 1%.

Q4:

Locate this sequence in the human genome, and specify the gene it belongs to.

Q5:

Describe the location of the above SNP in the gene, as precisely as possible. If it is a coding location, determine the possible functional effects of this mutation in terms of amino acid changes.