HST.950/6.872 Problem Set 4

Due 10/7/2004


1. In the "relevance networks" (RN) paper, the expression baselines of 7245 genes and 5084 anticancer agents were combined into one data set based on 60 cancer cell lines. If we apply the self-organizing maps (SOM) method to this data set and for the same analyzing purpose, answer the following questions:

a. How many dimensions will the space be? How many data points do we have?

b. In the SOM paper, two important data preprocessing procedure were introduced: variation filter to eliminate genes that did not change significantly across samples, and normalization of expression levels. These filters were not applied in the RN paper. To get similar results as that of RN method, do you want to apply these filters? Why?


2. Suppose we apply the RN method to the data set in the SOM paper, expression patterns of some 6000 genes. If we gradually reduce the threshold from 1 until we get the first one or few relevance networks, which cluster(s) in FIG2.a (from the SOM paper) will include most of the genes in these relevance networks? Why?


3. Discuss the major differences between RN and SOM methods.


4. Sickle cell anemia is an autosomal recessive disorder caused by a defect in the HBB gene, which codes for hemoglobin. In the Unites States, it affects around 72,000 people, most of whose ancestors come from the Sub-Saharan region. The disease occurs in about 1 in every 500 African-American births. What is the proportion of African Americans carrying the mutant allele?


5. In a genomic study, we have recruited 10 individuals and genotyped two consecutive loci. The alleles of the resulting 20 chromosomes are listed in the following table. Chromosomes with ID 1 and 2 are from individual 1, chromosomes with ID 3 and 4 are from individual 2, and so on. Compute the degree of linkage disequilibrium between the two loci.

1 A G
2 A G
3 A G
4 T T
5 A G
6 T G
7 A G
8 A G
9 T T
10 T T
11 A G
12 T G
13 A G
14 A T
15 A G
16 T T
17 A G
18 A G
19 T G
20 T T