Genome Databases Get Faster, Bigger, Stronger

DOE Joint Genome Institute expands data and analytical tools.

Image used with permission from Markowitz, V. M., et al. “IMG 4 version of the integrated microbial genomes comparative analysis system.” Nucl. Acids Res. 42(D1), D560–D567 (2014).
Integrated Microbial Genomes (IMG), a data warehouse run by the Department of Energy’s Joint Genome Institute, provides tools for analyzing the structural and functional annotations of genomes in a comparative context. This screenshot shows some of the system’s capabilities for exploring RNA sequencing data.

The Science

The U.S. Department of Energy Joint Genome Institute (DOE JGI) maintains the Integrated Microbial Genomes (IMG) data warehouse, which contains a rich collection of genomes from all three domains of life. IMG/M provides a similar collection of partially assembled genome reads from microbial communities (metagenomes). Both databases have recently been upgraded to address the increase in genome sequences and provide more options for users.

The Impact

The swiftly growing number of genomes and metagenomes available for analysis continues to challenge DOE JGI’s data systems, which are cited in hundreds of publications and used by students learning genomics. Improvements in both systems have expanded their capacity and added new data analysis tools.

Summary

IMG was introduced in 2005. Since the last published report describing the system in 2012, both IMG and IMG/M have grown and improved. These enhancements—outlined in two reports in the Jan. 1, 2014, issue of Nucleic Acids Research—include over 16,000 genomic datasets with more than 42 million protein-coding genes as part of the late 2013 version of IMG. This is more than three times the number of genomes the system contained 2 years ago, and most of the genomes (nearly 12,000) are bacterial, archaeal, and eukaryotic. IMG also includes thousands of viral genomes and hundreds of genome fragments, along with plasmids that did not come from a specific microbial genome sequencing project. Also in late 2013, IMG/M contained 3,328 metagenomic datasets from 460 metagenomic studies, with more than 19.5 billion protein-coding genes.

Both systems feature sophisticated analysis tools for publicly available datasets. The latest version of IMG includes tools for recording and analyzing single-cell genomes, RNA sequencing data, and gene cluster coding for synthesis of complex organic molecules (biosynthetic clusters). The databases are continually improved to keep pace with recent advances in genomics. Future enhancements for IMG will include incorporating data and analysis tools for pangenomes (core genes common to all individuals in a species, as well as variant genes to enable some individuals to adapt to different environments). Enhancements for IMG/M include  the addition of metaproteomic datasets (protein samples collected from environmental sources).

Contact

Nikos C. Kyrpides
DOE Joint Genome Institute
nckyrpides@lbl.gov

Funding

This research was funded by the Office of Biological and Environmental Research within the U.S. Department of Energy’s (DOE) Office of Science under contract no. DE-AC02-05CH11231 and used resources of the National Energy Research Scientific Computing Center, which is supported by the DOE Office of Science under contract no. DE-AC02-05CH11231. Funding for the journal open-access charge provided by the University of California.

Publications

Markowitz, V. M., et al. “IMG 4 version of the integrated microbial genomes comparative analysis system.” Nucl. Acids Res. 42(D1), D560–D567 (2014). [DOI: 10.1093/nar/gkt963].

Markowitz, V. M., et al. “IMG/M 4 version of the integrated metagenome comparative analysis system.” Nucl. Acids Res. 42(D1), D568–D573 (2014). [DOI: 10.1093/nar/gkt919].

Related Links

IMG Website

IMG/M Website

Highlight Categories

Program: BER , BSSD

Performer: SC User Facilities , BER User Facilities , JGI