6.872/HST950 Problem Set 3
Handed out:
Due: Thursday, April 8, 2004
In an earlier lecture, we had outlined a “theory of record
linkage” (the full paper is linked from our class schedule page) that tells us,
in principle, how to do probabilistic matching of various features of two
objects in order to decide whether they are likely to be the same object.
Briefly, the theory is as follows. I have interspersed questions for you to
answer with the description.
Given two purported objects (e.g., patients), o1 and o2, it is
either the case that o1=o2 or
that they are distinct individuals. For example, our records contain a
patient file for Raul
P. Szolovits of
Peter Szolovits of
Among all the observations we might make of o1 and o2, we
select a certain set of
features fi(o)
that we agree will be of interest. For example, we might choose last name,
first and middle names, street address, city, and ZIP code.
For each pair of features fi(o1), fi(o2), we can
compare the probability that one would
observe fi(o1), fi(o2) in
either of the two cases of step 1. For example, assuming that half the
hospital’s patient population have home addresses in . Further, if 1% of people in the city live on Main St, then
We may get an
additional likelihood ratio of 1000 (say) for the address, 123, and another
factor of, say, 1.5, for both states being MA. These are both estimates, and
answer the question what fraction of all addresses is 123, or what fraction of individuals like in MA. If our initial database contains
records on 1M individuals, then we might argue that the a priori odds
are essentially
. If we assume conditional independence of each of the
feature pairs from each other e.g., if we believe that you are no more likely
to get
matching street numbers on
estimate is.
Q1:
We have two
records as follows. Please estimate the likelihood ratio that these are
actually of the same person. You can assume any probabilities you need for the
calculation, but please suggest a practical way to estimate these
probabilities, from any resource available to the hospital, or from a small
study. The way by which you estimate these probabilities is
as important as, if not more than, the likelihood ratio calculation. Assume
independence between the variables and ignore the possibilities of typos and
errors.
Variables |
Record1 |
Record2 |
First name |
George |
George |
Last name |
Bush |
Bush |
Gender |
M |
M |
Date of birth |
|
|
Address |
|
|
City |
|
|
State |
TX |
TX |
In reality, we must also consider the effect of typos and
errors. For instance, a zip code 02113 can be miswritten as 02131, a common
transcription error. Another possible problem is that the same person may use
different first names, e.g. Dave and David, or people may be referred to by
their middle names. Also sometimes people put middle names/initials on the
registration form, but sometimes they don’t. Treating such mismatches can be
challenging problems.
Q2:
Consider the
possible typo that the dates of birth can actually be the same but with a
transposition error. Please re-estimate the likelihood ratio of the above two
records. You can assume any probabilities you need for the calculation, but
please suggest a practical way to estimate these probabilities, including the
possibility of transposition errors, from any resource available to the
hospital, or from a small study. The way by which you estimate these
probabilities is as important as, if not more than, the likelihood ratio
calculation. Assume independence between the variables.
The assumption of conditional independence among pairs of (mis)matching features is not really appropriate under some
circumstances, no matter how convenient it may be. For example, different
ethnic groups tend to have different last names (e.g., you might be more
tempted to look for my ancestry among Central Europeans than Chinese,
Hawaiians, Welsh or Hispanics). But the distribution of first names often
follows similar ethnic patterns. Therefore, if you compare two records each
with the name “Raul Gonzales”, the likelihood ratio should almost certainly not
be as high as the product of the likelihood ratios for “Raul” and for
“Gonzales”. Intuitively, once I learn that two records have a Hispanic last
name, then a further match on a Hispanic first name should be less impressive
than that same further match would be in conjunction with a Slavic last name
(because that combination is much more rare). The census bureau (www.census.gov) does
not, to my knowledge, publish statistics on name distributions in different ethnic
groups or on the correlations between first and last names.
If
such distributions are available, however, we can make a first-order adjustment
of such dependencies with a simple Bayesian model. Assuming P (first | ethnic)
and P (last | ethnic) are conditionally independent given the ethnic group, we have .
In Partners
EMPI, three groups of variables were used for patient matching:
– Last and First names
– First name, Date of Birth and Gender
– Social Security Number
Q3:
Are there any
possible dependencies among these variables, besides that of first names and
last names based on ethnicities? If so, please suggest a mathematical model to
make a first-order adjustment.
The next two questions are based on an exercise of bioinformatics database usage. In the course of this exercise, it is important that you cite the bioinformatics databases you use and that you record the steps you perform to reach your answer.
A region of chromosome 7 was sequenced in a cohort of individuals (of unknown ethnicity) affected by pulmunary disease, resulting in the following DNA sequence:
TGGCTAACAAAACTAGGATTTTGGTCACTT[C/T]TAAAATGGAACATTTAAAGAAAGCTGACAA
Note that the central location exhibits a C to T polymorphism. In particular, 36% of the subjects presented the T allele of the SNP. By contrast, in a cohort of comparable size of healthy subjects, the frequency of the T allele was found to be at 1%.
Q4:
Locate this sequence
in the human genome, and specify the gene it belongs to.
Q5:
Describe the location
of the above SNP in the gene, as precisely as possible. If it is a coding
location, determine the possible functional effects of this mutation in terms
of amino acid changes.