|
It was about 10:30 at night, and except for a small desk lamp and the
glow from my computer monitor, it was dark. The genome of the
photosynthetic bacteria Prochloroccocus marinus, responsible for
providing nearly 40% of the world's energy needs, had just been
sequenced. I loaded the draft version of the genbank file into my
software pipeline I spent the last two years developing. Minutes
later, a genome-scale reconstruction of the P. marinus metabolism
appeared on my screen. In that moment, I knew I was the first person
to view the entire metabolism of this cyanobacteria. I remember the
chills going down my spine, the palpitation of my heart, and the rush
that comes only from breathing the rarified air of discovery.
The original motivation for this research came from a paper written in
1999 called "Towards Metabolic Phenomics: An analysis of Genomics
using Flux Balances". In this paper, Schilling, Edwards and Palsson
noted that with the rapid completion of bacterial genomes, ORFeomes
and proteomes, we are therefore at the brink of having a complete
'part catalog' of many organisms. Based on that observation, they
predicted our ability to understand the complex relationship
between genotype and phenotype will be limited "not by the data, but
by our tools to analyze and interpret this data." Finally, they
proposed a bioinformatics pipeline to automatically generate
metabolic flux models from an annotated genome, arguing that
a rigorous constraint-based analysis of these models would enable us to
iteratively refine our knowledge.
Inspired by this argument, I implementated their proposal for
the Church Lab at Harvard Medical School, enabling us to generate
experimentally verifiable flux predictions based on different
hypotheses for bacterial growth. This work is described in a paper we
published with the ungainly title of "From annotated genomes to
metabolic flux models and kinetic parameter fitting." Based on this
experience, I learned several surprising and valuable lessons.
Lesson 1: "A good representation is the key to good problem solving"
--Patrick Winston
Although these words were said in the context of problems in
Artificial Intelligence, the principle applies directly to the problem
of mapping genome annotations to metabolic flux models. Such a mapping
requires a rich ontology capable of representing the subtle
relationships between genes, proteins, enzymes, biochemical reactions,
and metabolic pathways. Using the representation underlying SRI's
BioCyc database, I was able to develop a bioinformatics pipeline to
generate metabolic flux models directly from an annotated genome,
perform consistency checks on the data using their powerful query
language, and represent the metabolic flux models in a form that could
be analyzed using Flux Balance Analysis (FBA) and Minimization of
Metabolic Adjustment (MOMA).
Lesson 2: "Standard is better than best" --Gerald J. Sussman
Because license restrictions on the BioCyc database prevented
me from publishing most of the models I generated, I decided to
collaborate with SRI to develop an open standard for the
representation of metabolic pathways called BioPAX. Because we plan to
develop open source semantic web technologies to infer metabolic flux
models from annotated genomes, aggregate pathways from multiple data
sources, and perform consistency checks on the pathway data, we
decided to use the W3C recommended web ontology language (OWL) to
represent the BioPAX ontology. According to the Pathway resource list
at http://biopax.org, over 150 biological pathway databases currently
exist. However, to consolidate all this knowledge for a particular
organism, it is necessary to extract the pathways from each database,
transform each pathway into a standard data representation, and load
the data into a repository. As part of the BioPAX working
group, which developed the BioPAX ontology to facilitate this goal, I
now direct a community effort to extract, transform and load metabolic
pathways into BioPAX.
Lesson 3: "The great thing about standards is that there are so many
from which to choose" -- unknown
Until recently, the method for exchanging metabolic flux models has
been a hodgepodge of spreadsheets, flatfiles, and binary images. It
was impossible to recover the information about how a model was
developed from this data, and in many cases, the semantic
interpretation of the model was open to question. Interoperability
consisted of writing converters between different kinds of metabolic
analysis tools, each of which expected a different format. To address
these issues, we adopted the Systems Biology Markup Language (SBML) to
standardize the representation of our models. By consolidating our
tool set around a standard, we could focus on simulations rather than
data manipulations. Furthermore, by using BioPAX metadata to annotate
SBML, each metabolic pathway can be traced back to the database from
which it came.
Lesson 4: "Six weeks in the laboratory can save you six minutes at the
computer" --Tom Knight
Of course, the only result these efforts accomplished was to shift the
rate limiting step back to the generation of experimental data. At a
recent Genome annotation meeting in Washington D.C., Peter Karp showed
nearly 40% of the known biochemical reactions have orphaned
enzymes. Even for E. coli, over one hundred enzymes responsible for
catalyzing known biochemical reactions have an unknown sequence. This
finding exposes the fragile state of the underlying genomic
infrastructure. It is simply not true that we have the "part catalog"
of many organisms in hand when nearly 40% of known biochemical
function space is inaccessible to sequence homology searches like
BLAST. What is needed is a call to arms from the biochemistry labs.
An all-hands-on-deck approach to fill the gaps in our knowledge. The
goal is straightforward: to make it truly possible to generate
complete and consistent models derived entirely from the genome.
Attainment of such a goal will finally bridge the gap between functional
genomics data and system models.
Lesson 5: "Above all, one must have a feeling for the
organism" --Barbara McClintock
In the Introductory Systems Biology course I am co-developing for
Harvard's Department of Molecular Cell Biology, we plan to show our
students a movie of a neutrophil chasing a bacteria, eventually
engulfing it. The goal of the course is to use modeling and
simulation to understand the complex regulatory mechanisms involved in
chemotaxis. By developing mathematical insights into the model, we
will teach our students how to develop intuitions about what is
necessary to model in detail and what is appropriate to abstract.
More importantly, by systematically and rigorously testing the
knowledge inferred by these models, we want them to gain a
feeling for the organism.
Longer term career goals center around the representation,
integration, modeling and simulation of biochemical pathways to
elucidate the complex relationship between genotype and phenotype.
|