6.872/HST950 Sample Projects
Here is a set of example projects for the final project for 6.872. Some of
these were prior final projects.
- Calculate which documents at Pubmed (see ncbi.nlm.nih.gov) are the closest
to any patient's context based on a problem list. You can use the problem list
from the Clinician's Workstation database (an available clinical database for
the class) .
- Use the Gene Ontology nomenclature to cluster gene results in a microarray
experiment based on how close two genes are in the literature citing them.
- Translate the CWS (or other clinical database) relational database into an
XML stream and re-constitute the data in a better normalized data model.
Compare your effort and results when the "better" data model is a generic
model vs. one containing many distinct tables.
- Implement role-based access to the CWS (or other clinical database).
Implement cryptographic authentication. Describe the weaknesses of your
approach and how you would compromise privacy in your system if you were a
- How will patients understand the meaning of a polymorphism/mutation that
they were just "diagnosed" with. Create a web-enabled script that given a locus
link ID returns all the gene polymorphisms, OMIM citations associated with
that gene. Then automatically scour the entire web and pubmed to find
consumer-oriented (i.e. plain english) explanations of information that would
help a patient understand the meaning of a polymorphim/mutation of a
- Given a microarray data set, articulate all the sources of error in
measurement and find them in the particular data set. How will these errors
affect the conclusions of the paper based on these results? Can you estimate a
probability distribution of errors across the microarray surface? Find
microarrays in a data set which are outliers in this distribution. Illustrate
these analyses with two-dimensional surface plots/error surfaces.
- Calculate the distibution of the melting temperature of the
oligonucleotide probe sets that Affymetrix uses for its microarrays. Determine
the relationship between the measured variability of gene expression in one or
more data sets and the variation of the melting temperature within a probe
set. Can you identify a subset of probes (from the .cel files) which reduce
variability for a probe set? How does using ONLY those reduced variability
probe sets affect the classification task for which the data set was
- Develop a set of tools to de-identify sensitive medical records by finding
and replacing data that appear to identify the patient, such as names,
nicknames, various identification numbers, addresses, phone numbers, etc.
- Use simple natural language techniques to extract interesting coded
aspects of the medical record from unstructured text. For example, find
all mention of medications, dosages, routes of administration, etc.
Another example would be to try to determine by textual analysis whether the
description of a patient might be consistent with some specific disease of
public health interest (ranging from meningitis to anthrax).
This list is meant only to be suggestive. Any reasonable project of
roughly the above level of sophistication will be welcome. We ask you to
submit a proposal (due date to be determined) that describes the proposed project in
enough detail that we can critique it. It should also identify the group
of people (we suggest 2-3) who will work together on the project.