Jeremy Zucker: Statement of Purpose

It was about 10:30 at night, and except for a small desk lamp and the glow from my computer monitor, it was dark. The genome of the photosynthetic bacteria Prochloroccocus marinus, responsible for providing nearly 40% of the world's energy needs, had just been sequenced. I loaded the draft version of the genbank file into my software pipeline I spent the last two years developing. Minutes later, a genome-scale reconstruction of the P. marinus metabolism appeared on my screen. In that moment, I knew I was the first person to view the entire metabolism of this cyanobacteria. I remember the chills going down my spine, the palpitation of my heart, and the rush that comes only from breathing the rarified air of discovery.

The original motivation for this research came from a paper written in 1999 called "Towards Metabolic Phenomics: An analysis of Genomics using Flux Balances". In this paper, Schilling, Edwards and Palsson noted that with the rapid completion of bacterial genomes, ORFeomes and proteomes, we are therefore at the brink of having a complete 'part catalog' of many organisms. Based on that observation, they predicted our ability to understand the complex relationship between genotype and phenotype will be limited "not by the data, but by our tools to analyze and interpret this data." Finally, they proposed a bioinformatics pipeline to automatically generate metabolic flux models from an annotated genome, arguing that a rigorous constraint-based analysis of these models would enable us to iteratively refine our knowledge.

Inspired by this argument, I implementated their proposal for the Church Lab at Harvard Medical School, enabling us to generate experimentally verifiable flux predictions based on different hypotheses for bacterial growth. This work is described in a paper we published with the ungainly title of "From annotated genomes to metabolic flux models and kinetic parameter fitting." Based on this experience, I learned several surprising and valuable lessons.

Lesson 1: "A good representation is the key to good problem solving" --Patrick Winston

Although these words were said in the context of problems in Artificial Intelligence, the principle applies directly to the problem of mapping genome annotations to metabolic flux models. Such a mapping requires a rich ontology capable of representing the subtle relationships between genes, proteins, enzymes, biochemical reactions, and metabolic pathways. Using the representation underlying SRI's BioCyc database, I was able to develop a bioinformatics pipeline to generate metabolic flux models directly from an annotated genome, perform consistency checks on the data using their powerful query language, and represent the metabolic flux models in a form that could be analyzed using Flux Balance Analysis (FBA) and Minimization of Metabolic Adjustment (MOMA).

Lesson 2: "Standard is better than best" --Gerald J. Sussman

Because license restrictions on the BioCyc database prevented me from publishing most of the models I generated, I decided to collaborate with SRI to develop an open standard for the representation of metabolic pathways called BioPAX. Because we plan to develop open source semantic web technologies to infer metabolic flux models from annotated genomes, aggregate pathways from multiple data sources, and perform consistency checks on the pathway data, we decided to use the W3C recommended web ontology language (OWL) to represent the BioPAX ontology. According to the Pathway resource list at http://biopax.org, over 150 biological pathway databases currently exist. However, to consolidate all this knowledge for a particular organism, it is necessary to extract the pathways from each database, transform each pathway into a standard data representation, and load the data into a repository. As part of the BioPAX working group, which developed the BioPAX ontology to facilitate this goal, I now direct a community effort to extract, transform and load metabolic pathways into BioPAX.

Lesson 3: "The great thing about standards is that there are so many from which to choose" -- unknown

Until recently, the method for exchanging metabolic flux models has been a hodgepodge of spreadsheets, flatfiles, and binary images. It was impossible to recover the information about how a model was developed from this data, and in many cases, the semantic interpretation of the model was open to question. Interoperability consisted of writing converters between different kinds of metabolic analysis tools, each of which expected a different format. To address these issues, we adopted the Systems Biology Markup Language (SBML) to standardize the representation of our models. By consolidating our tool set around a standard, we could focus on simulations rather than data manipulations. Furthermore, by using BioPAX metadata to annotate SBML, each metabolic pathway can be traced back to the database from which it came.

Lesson 4: "Six weeks in the laboratory can save you six minutes at the computer" --Tom Knight

Of course, the only result these efforts accomplished was to shift the rate limiting step back to the generation of experimental data. At a recent Genome annotation meeting in Washington D.C., Peter Karp showed nearly 40% of the known biochemical reactions have orphaned enzymes. Even for E. coli, over one hundred enzymes responsible for catalyzing known biochemical reactions have an unknown sequence. This finding exposes the fragile state of the underlying genomic infrastructure. It is simply not true that we have the "part catalog" of many organisms in hand when nearly 40% of known biochemical function space is inaccessible to sequence homology searches like BLAST. What is needed is a call to arms from the biochemistry labs. An all-hands-on-deck approach to fill the gaps in our knowledge. The goal is straightforward: to make it truly possible to generate complete and consistent models derived entirely from the genome. Attainment of such a goal will finally bridge the gap between functional genomics data and system models.

Lesson 5: "Above all, one must have a feeling for the organism" --Barbara McClintock

In the Introductory Systems Biology course I am co-developing for Harvard's Department of Molecular Cell Biology, we plan to show our students a movie of a neutrophil chasing a bacteria, eventually engulfing it. The goal of the course is to use modeling and simulation to understand the complex regulatory mechanisms involved in chemotaxis. By developing mathematical insights into the model, we will teach our students how to develop intuitions about what is necessary to model in detail and what is appropriate to abstract. More importantly, by systematically and rigorously testing the knowledge inferred by these models, we want them to gain a feeling for the organism.

Longer term career goals center around the representation, integration, modeling and simulation of biochemical pathways to elucidate the complex relationship between genotype and phenotype.