The Tidor Lab

Information Theoretic Approaches to Identifying Informative Subsets of Biological Data

Introduction and Motivation

Understanding how biological systems process and respond to environmental cues requires the simultaneous examination of many different species in a multivariate fashion. High-throughput data collection methods offer a great deal of promise in offering this type of multivariate data, however, analyzing these high-dimensional datasets to identify which of the measured species are most important in mediating particular outcomes is still a challenging task. Additionally, when designing subsequent experiments, existing datasets may be useful in selecting the measurements, conditions, and time points that best capture the relevant aspects of the full set.

Information theory offers a framework for identifying the most informative subsets of existing datasets, but the application to biological systems can be difficult when dealing with relatively few data samples. The goal of this project is to develop and validate a method for approximating high-order information theoretic statistics using associated low-order statistics that can be more reliably estimated from the data. Using this technique, we aim to identify subsets of biological datasets that are maximally informative of the full system behavior or of defined outputs according to information theoretic metrics.

Information Theory Framework

Information theory provides a natural framework for identifying sets of features that have signicant statistical relationships with each other or with external variables. The two fundamental concepts from the theory are information entropy and mutual information (MI) which quantify statistical uncertainty and statistical dependency, respectively. Though similar to the correlation-based statistics variance and covariance, the information theoretic statistics have important advantages. In addition to being invariant to reversible transformation and able to capture nonlinear relationships, information theoretic statistics can be applied to categorical variables (such as the classification of a particular tumor) as well as continuous ones (such as the expression level of a gene from a microarray). These statistics can also be extended to quantify the relationships of sets of features using joint entropy and multi-information.

High-Order Entropy Approximation

At the core of the project is a method for approximating high-order information theoretic statistics such as information entropy and mutual information from associated low-order terms. Estimating high-order entropies directly from sparsely-sampled data is extremely unreliable, while estimating second- or third-order entropies reliably has been shown to be possible. In many biological systems, high-order interactions may be rare, meaning that a model incorporating only low-order information can often perform reasonably well.

To assess the performance of the various levels of approximations in different sampling regimes, we explored a series of randomly generated relational networks with analytically computable entropies with widely varying topologies and orders of influence. Using our approximation framework, we have shown that for the sampling regimes typical of biological systems, the low-order approximations significantly outperform direct estimation.

Applications to Biological Data

We aim to apply these approximation techniques to identify subsets of high-dimensional biological datasets that are most informative of system outputs. We have begun projects involving a variety of types of data and for a range of applications including:

identifying informative subsets of gene expression levels from microarrays for cancer classification
identifying characteristic timepoint sets in the collection of multivariate data of protein signalling
identifying informative sets of parameters in biological models
identifying sets of sequence positions that are most informative about properties such as binding affinity, using sequence alignments

While the specific applications are broad in scope, they all focus on identifying biological subsets that are statistically informative, with an emphasis on better representing the high-dimensional data in a more compact way and better understanding the biological systems underlying the data.

Validation of approximation framework
on synthetic systems

Figure 1. To evaluate the approximation framework, we simulated 100 randomly generated networks with analytically computable joint entropies and applied the metrics using a range of sample sizes. When the analytically entropies are known exactly (top left), the higher-order approximations performing increasingly well. When the entropies are estimated from a finite sample, however (lower row), the approximations provide the best estimates. This behavior is quantified by computing the sum squared errors of each metric as a function of the sampling regime (top right). The best approximation to use depends upon the amount of data available, but for all cases examined with finite sample size, the approximations outperform direct estimation. Accessibility