The Uncovering Genome Mysteries Project

A joined effort between World Community Grid, sponsored by IBM, and researchers from the Oswaldo Cruz Foundation - Fiocruz, Brazil, and the University of South New Wales, Australia.

The Uncovering Genome Mysteries aims at the identification of organisms and genome functional annotation of genes and proteins from biodiversity through high performance computing with support from the World Community Grid. Analysis involves all taxonomic groups, which we divide for practical reasons in procaryotes (Bacteria and Archaea) and eucaryotes such as Protozoa, Chromista (Algae), Plantae, Fungi and Animalia (including Homo sapiens).

Most of the well characterized organisms on earth are microorganisms, which are microscopically small life forms, mostly single celled, and comprise bacteria, archaea (bacteria-like organisms that live often in more extreme environments), protozoa (like amoeba and several parasites) but also yeasts and microscopic algae. Members of this diverse group are present in almost all environments on earth, in the air, water, earth, rocks, and even where conditions are very harsh, such as deserts, undersea sulphuric volcanic vents or in the Antarctic ice. They are crucial elements in the maintenance of all ecological systems and closely interact with one another and with other life forms, being present upon and within other living systems such as plants, animals and humans. For example, there are about 10 times more microorganisms living in and upon our bodies than we have human cells of our own. Often the interaction is beneficial for both, sometimes neutral, and sometimes pathogenic. They are important for human health (e.g. our gut bacteria like Escherichia coli, or the probiotic bacterium Lactobacillus help digesting our food), in agriculture (e.g. the nitrogen-fixing root bacterium Rhizobium, or the many types of soil bacteria that degrade organic matter and perform chemical transformations) or in food production (e.g. the baker's yeast Saccharomyces cerevisiae). Microorganisms also represent the great unseen and underappreciated majority of life on our planet. On a global scale they equal or even surpass all plants in terms of mass and diversity and one teaspoon of fresh or seawater or of soil contain over one million microorganisms.

Microorganisms have the capacity to degrade, convert or produce almost any organic compound and this allows them, for example, to degrade toxins and many types of waste (bioremediation) or to produce drugs (antibiotics, or in pharmaceutical biotechnology). Microorganisms are also key to crucial steps in global biochemical cycles, such as the removal and sequestration of CO2 or the fixation of nitrogen from the atmosphere, and the transformation of decaying organic matter in new nutrients like in biodegradation. For example, it has only been recently recognized that microbial processes are responsible for the majority of CO2 absorption by the world's oceans. This and other microbial functions are clearly essential to support all higher life forms, such as plants or humans, and to keep our global ecosystem in balance. Without a proper functioning of microorganisms the health of our planet would quickly deteriorate and higher organisms, including humans, would cease to exist. Despite their importance for our planet's health, we know little about the diversity and detailed function of microorganisms in the environment. This is partially due to their microscopic size and the difficulty of isolating and studying microorganisms in the laboratory, as the majority does not grow in petry dishes. However, technical breakthroughs in the last decade have allowed to study microorganisms to unprecedented level of detail by determining and deciphering the DNA sequence of their genetic code (their genomes).

The genome determines the properties of a microorganism and by reading the genetic code we can better predict and understand microbial function. Our understanding of how soil bacteria transform cellulose, how pathogenic organisms cause disease, how yeasts produce different metabolites depending on their growth medium and conditions, or how bacterial colonies act together and form biofilms on different surfaces is now incomparably more detailed than before the genomics era. Modern DNA sequencing technologies can now determine millions and billions of DNA sequences in short periods of time, at reasonable costs. New technological breakthroughs are being developed to augment our capacity with several orders of magnitude, which will allow scientists to determine all DNA sequences hidden in the unseen microbial world, as they are already doing so for lots of medical and industrially important unicellular and multicellular organisms, animals, plants and human individuals over the last few years.

Since the nineties, genome analyses have concentrated on studying "model organisms" in biology, because they had been studied for decades in laboratories (from Escherichia coli, to yeast, helminthes, mouse) to important human, animal and plant pathogens (like Mycobacterium tuberculosis and M. leprae, or crop pathogens) or because they are seen as representative organisms in the representation of the "Tree of Life". Relatively recently, scientists realized that there are far more microorganisms around in nature, both in number and in variety, we just did not know that because they did not show up in laboratory cultures. But taking a sample of soil, water, saliva, the surface of a leaf, the gut of a cow, air, or just about anything, preparing the total genetic material present in that sample (called "the metagenome"), determining the genetic codes and by computer analyses sort the fragments of data to genomes of different organisms, scientists started learning about the hugely more complex living communities around, and a sudden view of the universe of biological diversity opened up. From then on, many efforts around the world are invested in generating and analyzing such metagenome data, enriching our knowledge about biodiversity in air, land and sea, from the arctic to tropical forests. And slowly, a very complex picture of the diversity of living organisms on our planet is emerging, not to mention the studies on the individual variations of pathogen strains, or of humans.

However, the daunting task of interpreting the now huge and exponentially growing amount of sequence data ("Big Data") is not trivial. DNA sequence information is off course only meaningful and useful, if it can be decoded and interpreted by comparing it to other gene sequences of known or unknown function, a process called "genome annotation", and variations can be mapped. This decoding and annotation process requires vast amounts of computational power and this is currently a major bottleneck in making sense of the sequence data available.

The "Uncovering Genome Mysteries" project aims to harness the computational power of the World Community Grid to give biological meaning to sequencing data available for microorganisms. This is done on the level of comparison between individual microbial genomes as well as on the genetic information of entire microbial communities for the environment (metagenomes). Decoding genomes and metagenomes provides new information on the functional role and diversity that microorganisms play in the environment. Comparison of this information with known functional data from other organisms already studied in greater detail is crucial for the interpretation and annotation of the codes. This functional, microbial information will be useful in many ways. Firstly, new gene functions are discovered and these may encode steps in new biochemical pathways. This has great impact on the expansion of our knowledge on biochemistry as a whole, and in particular on the way those organisms interact with one another, and on environmental stimuli (for example, the presence of more or less nutrients of different types, more acidic environment, pollutant chemicals). But such data are also very useful for the identification or design and production of new antibiotics, drugs against chronic diseases, or new enzymes for industrial applications, such as food processing, chemical synthesis or the production of green plastics or biofuels. Secondly, we will use this information to document the current baseline microbial diversity and this will allow us to understand how microorganisms change under environmental stress, such as climate change. And finally, we hope that this information will allow us to better understand and model complex microbial systems with the long-term aim to manage their important function in the world's ecosystem, whether in the environment, in industrial settings, or in human, animal and plant interactions.