CDM Seminar Series 2003-04


An Exploratory Analysis on Nurses’ Health Study

Delin Shen
PhD Candidate, Clinical Decision Making Group
April 28, 2004
2pm, 32-250


Breast cancer is the most common cancer in women, and has been one of the most
intensively studied areas of medical genetics. Two relatively high penetrance
genes BRCA1 and BRCA2 have been identified, but they account for only about
half of the families with hereditary breast cancers. Other susceptibility gene
areas have been proposed and examined, but no definitive result has been

Mutational or genotypic heterogeneity can explain some of the clinical
variability observed in single-gene diseases, but usually not all, especially
the complex traits, which is probably due to modifier genes and environmental
contributors. Most of the previous research on breast cancer has focused on
only one or a few genes, and rarely considered the influence of environmental
factors. The lack of definite results of the past research illustrated that
breast cancer is a complex trait, and therefore, exploring the susceptible
genes and possible contributing environmental factors systematically can be a
good alternative to evaluating the genes individually and separately.

We propose an exploratory analysis on the Nurses’ Health Study, by applying
machine learning techniques, in particular Bayesian Networks, to a collection
of genotype and phenotype data with very detailed annotations, to develop a
systematic method for the large scale and automated modeling of the interplay
between genetic polymorphisms, the influence of environmental factors, and
phenotypes. The possible results of this research may include:

• New medical findings
• A graphical representation of the interplay among genotypes, environmental factors, and phenotypes
• A new tool to facilitate the exploring of large cohort health study
• Improvement and/or modifications in machine learning algorithms

The method we plan to use is to apply Bayesian Networks and other machine
learning techniques on the Nurses’ Health Study, a large cohort study with a
longitudinal record of more than twenty years. The data set is very
comprehensive and it gives a unique opportunity to study the modifier genes and
environmental contributors, including temporal changes. Bayesian Networks is a
graphical representation of the dependency structure of variables and
therefore, suitable in constructing a landscape of the interplay among
genotypes, environmental factors, and phenotypes.


CDM Seminar Series