Wikisource:WikiProject Open Access/Programmatic import from PubMed Central/The Metagenomic Telescope

The Metagenomic Telescope
Balázs Szalkai; Ildikó Scheer; Kinga Nagy; Beáta G. Vértessy; Vince Grolmusz, edited by Sebastian D. Fugmann
PLoS ONE , vol. 9, iss. p.


Next generation sequencing technologies led to the discovery of numerous new microbe species in diverse environmental samples. Some of the new species contain genes never encountered before. Some of these genes encode proteins with novel functions, and some of these genes encode proteins that perform some well-known function in a novel way. A tool, named the Metagenomic Telescope, is described here that applies artificial intelligence methods, and seems to be capable of identifying new protein functions even in the well-studied model organisms. As a proof-of-principle demonstration of the Metagenomic Telescope, we considered DNA repair enzymes in the present work. First we identified proteins in DNA repair in well–known organisms (i.e., proteins in base excision repair, nucleotide excision repair, mismatch repair and DNA break repair); next we applied multiple alignments and then built hidden Markov profiles for each protein separately, across well–researched organisms; next, using public depositories of metagenomes, originating from extreme environments, we identified DNA repair genes in the samples. While the phylogenetic classification of the metagenomic samples are not typically available, we hypothesized that some very special DNA repair strategies need to be applied in bacteria and Archaea living in those extreme circumstances. It is a difficult task to evaluate the results obtained from mostly unknown species; therefore we applied again the hidden Markov profiling: for the identified DNA repair genes in the extreme metagenomes, we prepared new hidden Markov profiles (for each genes separately, subsequent to a cluster analysis); and we searched for similarities to those profiles in model organisms. We have found well known DNA repair proteins, numerous proteins with unknown functions, and also proteins with known, but different functions in the model organisms.


The vast field of computer science, termed artificial intelligence (AI), offers powerful methods for distilling relevant information from large sets of data. Metagenomic databases have been increasingly used in the recent years to investigate the bacterial composition of samples taken from a variety of environments. To analyze and compare different genomic data, Hidden Markov Models [1] provide a useful methodology.

A Hidden Markov Model, applied to protein sequences, is basically a random amino acid sequence generator with multiple internal states, two of which are distinguished as START and STOP states. The generator starts from the START state. Until it arrives to the STOP state, it repeats the following two steps:# it outputs a random amino acid, then# it moves to a random new state (typically not in uniform distribution).

The role of the multiple internal states is that the probability distribution of the output amino acid and the distribution of the new state both depend on the current state. The model is named “hidden” because the internal states cannot be unambiguously determined by observing the output sequence.

HMMs are particularly useful because they can be trained by a set of input sequences to output similar sequences: if we have proteins of related functions, then we can build a Hidden Markov Model which will generate random amino acid sequences as output, similar to the ones used in training. It is even a more useful property of HMMs that if we take any amino acid sequence, denoted by w, our model can easily tell us the probability of generating exactly that sequence w as an output.

Consequently, if we have a HMM trained on a certain set of proteins, then the same HMM can assign higher scores (i.e., probabilities) to proteins similar to the training set, and lower scores (i.e., probabilities) to proteins dissimilar to the training set. Note that this scoring is usually not homogeneous as in the case of BLAST [2] and its clones [3]: in HMM models conservative subsequences are differentiated from those appearing in variable regions.

In the present work, we have applied HMM in a novel way to suggest and possibly discover still unknown protein functions in several well-studied model organisms. Starting from sequence alignments for proteins involved in DNA damage repair, we created Hidden Markov Models and used these models to search for similar genes in the metagenomic samples from different environments. Combining the original HMM with the genes found in the metagenomes, we created a second, more trained HMM that we used to interrogate proteomes of higher order model organisms. This search (termed as “Metagenomic Telescope” in the present study) generated numerous novel hits in the higher order organisms, containing proteins previously not yet described as closely similar to the DNA damage repair proteins. These results indicate the Metagenomic Telescope may be a powerful method for the identification of novel proteins in higher order model organisms.


First, we took some known E. coli and Archaean occurrences of a specific enzyme as listed in Table 1. We aligned these similar proteins using Clustal Omega [4]. The aligned sequences were then used to train a HMM with the

  1. Cite error: Invalid <ref> tag; no text was provided for refs named pone.0101605-Baum1
  2. Cite error: Invalid <ref> tag; no text was provided for refs named pone.0101605-Altschul1
  3. Cite error: Invalid <ref> tag; no text was provided for refs named pone.0101605-Banky1
  4. Cite error: Invalid <ref> tag; no text was provided for refs named pone.0101605-Sievers1