
Information Theoretic Approaches to Identifying Informative Subsets of Biological Data
Introduction and Motivation
Understanding how biological systems process and respond to
environmental cues requires the simultaneous examination of many
different species in a multivariate fashion. Highthroughput data
collection methods offer a great deal of promise in offering this type
of multivariate data, however, analyzing these highdimensional
datasets to identify which of the measured species are most important
in mediating particular outcomes is still a challenging
task. Additionally, when designing subsequent experiments, existing
datasets may be useful in selecting the measurements, conditions, and
time points that best capture the relevant aspects of the full
set.
Information theory offers a framework for identifying the most
informative subsets of existing datasets, but the application to
biological systems can be difficult when dealing with relatively few
data samples. The goal of this project is to develop and validate a
method for approximating highorder information theoretic statistics
using associated loworder statistics that can be more reliably
estimated from the data. Using this technique, we aim to identify
subsets of biological datasets that are maximally informative of the
full system behavior or of defined outputs according to information
theoretic metrics.
Information Theory Framework
Information theory provides a natural framework for identifying
sets of features that have signicant statistical relationships with
each other or with external variables. The two fundamental concepts
from the theory are information entropy and mutual information (MI)
which quantify statistical uncertainty and statistical dependency,
respectively. Though similar to the correlationbased statistics
variance and covariance, the information theoretic statistics
have important advantages. In addition to being invariant to
reversible transformation and able to capture nonlinear relationships,
information theoretic statistics can be applied to categorical
variables (such as the classification of a particular tumor) as well
as continuous ones (such as the expression level of a gene from a
microarray). These statistics can also be extended to quantify the
relationships of sets of features using joint entropy and
multiinformation.
HighOrder Entropy Approximation
At the core of the project is a method for approximating highorder
information theoretic statistics such as information entropy and
mutual information from associated loworder terms. Estimating
highorder entropies directly from sparselysampled data is extremely
unreliable, while estimating second or thirdorder entropies
reliably has been shown to be possible. In many biological systems,
highorder interactions may be rare, meaning that a model incorporating
only loworder information can often perform reasonably well.
To assess the performance of the various levels of approximations
in different sampling regimes, we explored a series of randomly
generated relational networks with analytically computable entropies
with widely varying topologies and orders of influence. Using our
approximation framework, we have shown that for the sampling regimes
typical of biological systems, the loworder approximations
significantly outperform direct estimation.
Applications to Biological Data
We aim to apply these approximation techniques to identify subsets
of highdimensional biological datasets that are most informative of
system outputs. We have begun projects involving a variety of types of
data and for a range of applications including:
 identifying informative subsets of gene expression levels from
microarrays for cancer classification
 identifying characteristic timepoint sets in the collection of
multivariate data of protein signalling
 identifying informative sets of parameters in biological models
 identifying sets of sequence positions that are most informative
about properties such as binding affinity, using sequence alignments
While the specific applications are broad in scope, they all focus
on identifying biological subsets that are statistically informative,
with an emphasis on better representing the highdimensional data
in a more compact way and better understanding the biological systems
underlying the data.
Figure 1. To evaluate the approximation framework, we simulated
100 randomly generated networks with analytically computable joint
entropies and applied the metrics using a range of sample sizes. When
the analytically entropies are known exactly (top left), the
higherorder approximations performing increasingly well. When the
entropies are estimated from a finite sample, however (lower row), the
approximations provide the best estimates. This behavior is quantified by computing the sum
squared errors of each metric as a function of the sampling regime
(top right). The best approximation to use depends upon the amount of
data available, but for all cases examined with finite sample size,
the approximations outperform direct estimation.

