Follow the Atoms

Probably no development is doing more to change the face of science today than the rise of computation as an increasingly prominent tool of discovery—whether in the form of modeling, simulation, data-mining, or other sophisticated number-crunching. Whether researchers are analyzing the masses of data produced at the Large Hadron Collider in the quest for the Higgs boson, modeling turbulence in pursuit of improved jet engines, simulating earth's climate, or designing new materials using advanced algorithms and high-performance computers, computation is playing an ever more central role in the sciences.

Click to enlarge photo. Enlarge Photo

An image of a schematic diagram.

Scientists today are seeking to reengineer metabolic pathways (represented in the schematic diagram shown above), which involve multiple biochemical reactions.

Nowhere perhaps has the transformation been more dramatic than in biology—long considered a field where mathematics played a comparatively minor part. Today mathematics is a key driver of progress in biology research. Computation is essential to the high-throughput genomic sequencing that has become the cornerstone of advances in contemporary biology. That is because high-end computers and complex software are needed to assemble the short DNA "reads" produced by modern genomic sequencing machines—like so many small pieces of a grand puzzle—into coherent whole genomes. Increasingly, also, researchers seek to understand organisms through comparative computer-driven searches of data on thousands of biological samples, combing for similarities that can illuminate the relationship of structure to function. And high-end computers have even begun to be used with some success by researchers to model basic mechanisms of cellular metabolism.

Much of this activity has been driven by the major advances in high-performance computing hardware—at the moment, for example, DOE's Oak Ridge and Argonne National Laboratories house the world's single fastest and fourth fastest supercomputers, respectively, both boasting petaflop processing speeds. But there has also been a general upsurge of interest in mathematical methods, even when the mathematics in question may not demand the processing power of a "TOP500" supercomputer.

An interesting example is recent work by a DOE Office of Science-supported team of researchers at the firm SRI International, led by computational biologist Peter D. Karp. Karp's team—including Mario Latendresse, Jeremiah P. Malerich, and Mike Travers—has devised a remarkably accurate mathematical method of mapping biochemical reactions at the atomic level. The approach, which relies on some elegant mathematics to simplify a complex problem, could lead to tools with potentially wide practical application. Many DOE-supported researchers (and indeed countless other researchers at universities and companies across the nation) are attempting to reengineer organisms—especially microbes—to produce certain chemical products, such as biofuels. This process, known as metabolic engineering, is central to the new effort to harness biology to solve problems in such areas as energy and the environment. Understanding more precisely how biochemical reactions unfold could provide researchers with important insights into how biological elements need to be precisely tuned to produce a desired result. Accurate atom mapping of biochemical reactions could therefore make a major contribution.

As its name implies, the purpose of atom mapping is to understand, on an atom-by-atom basis, how you get from point A to point B in a chemical or biochemical reaction, from a "reactant" molecule to a "product" molecule. Atoms get shifted around in the course of a reaction. In atom mapping, you essentially want to know which atom from the reactant ends up where in the product.

Click to enlarge photo. Enlarge Photo

Photo of computational biologist Peter D. Karp Photo courtesy of Peter D. Karp

Computational biologist Peter D. Karp led the team that developed an accurate method of mapping biochemical reactions atom by atom.

Traditional mathematical approaches to this problem have focused on molecular structure. In particular, algorithms have been designed to compare the structure of the reactant with the structure of the product and to find what is known as the "maximum common subgraph" (MCG) between them. MCG is a fancy term for what amounts a kind of structural "least common denominator" between the two compounds—essentially, it seeks to pinpoint the part of the structure that the two compounds have in common. Identifying the MCG is already a somewhat tricky exercise, because it is known in principle that the true or optimal MCG between two compounds is incapable of being found through computation; the real MCG is in principle incomputable. So what the algorithms tend to produce is a sort of vaguely satisfactory MCG. It may not be the best structural common denominator, but it's thought to be adequate enough for the purposes at hand.

In identifying the MCG, the algorithm pinpoints what in the reaction has remained unchanged between the reactant and the product. Once that step is complete, the algorithm can look at the rest of the atoms to determine where they have gone in the course of the reaction.

While this method does provide insights into how biochemical reactions might unfold, it has not proved to be particularly effective at generating specific, accurate atom mapping of particular reactions.

Recently, a team of Princeton researchers has proposed an alternative approach, relying on a mathematical method known as "linear programming." Linear programming is an optimization technique widely used in business and other fields. It enables you to take a complex process with multiple inputs and constraints and to determine the optimal mix of input quantities to produce a certain desired outcome—for example, what the best mix of raw materials might be to produce a certain product cost-effectively.

The advantage of relying on linear programming is its widespread use and the resulting ready availability of off-the-shelf software for solving linear programming problems. But like the MCG approach, the Princeton team's method—embodied in publicly available software called DREAM—also relies heavily on a structural understanding of the reactant and product molecules.

Perhaps the key innovation of the SRI team has been largely to dispense with the traditional structural focus in their analysis. The effect has been to simplify the mathematics of the problem. Instead of focusing on molecular structure or shape per se, the SRI researchers focused on the inherent relative strength or weakness of individual chemical bonds. In analyzing biochemical reactions, the researchers were concerned with a limited set of atoms: carbon, oxygen, nitrogen, phosphorus, hydrogen, and sulfur. The researchers developed weightings for the chemical bonds between (almost) all possible pairs of these atoms. These weightings reflected the bonds' inherent readiness to break or to form (or conversely, the bonds' relative strength and stability). The researchers then developed linear programming equations using these weightings.

Simply by weighting chemical bonds according to their readiness to form or to break, the researchers were able to generate accurate atom-to-atom mapping of the reactions.

Simply by weighting the bonds according to their readiness to form or to break, the researchers were able to generate accurate atom-to-atom mapping of the reactions. (To speed processing the researchers added short-cuts to handle a few very common reactions and also those reactions where a ring of atoms remained intact through the process, a common phenomenon in biochemistry. In addition, hydrogen-hydrogen bonds were excluded, since empirical data on these bonds were lacking in the database the researchers ultimately used to validate their approach.)

To test the relative effectiveness of their method, the SRI team drew on the world's largest database of already mapped biochemical reactions, a database maintained as part of the Kyoto Encyclopedia of Genes and Genomes (KEGG) based in Japan. They compared their mappings to 2,446 mapped biochemical reactions in what is known as the KEGG RPAIR database. It turns out that their computer-generated mappings were in error in just 22 cases—an error rate of less than 1 percent. By contrast, the researchers were able to compare 1,709 mapped reactions from the KEGG RPAIR database with mappings produced by the DREAM software. A total of 249 DREAM mappings proved to be in error—an error rate of 14 percent.

The researchers are working on further refinements of their approach. But the work of SRI team already represents a major step forward in atomic-level analysis of biochemical reactions—and an elegant illustration of the power of today's computational approaches to transform our understanding of biochemistry and biology.

—Patrick Glynn, DOE Office of Science, Patrick.Glynn@science.doe.gov

Funding

DOE Office of Science, Office of Biological and Environmental Research

Publication

Mario Latendresse, Jeremiah P. Malerich, Mike Travers, and Peter D. Karp, "Accurate Atom-Mapping Computation for Biochemical Reactions," Journal of Chemical Information and Modeling 52, 2970 (2012).

Related Links

SRI International, Bioinformatics Research Group

DOE Office of Science, Office of Biological and Environmental Research, Genomic Science Program