Orthology Prediction - Algorithm
 
 
Peer-reviewed articles
 
Overview of OMA:
[1] Christophe Dessimoz, Gina Cannarozzi, Manuel Gil, Daniel Margadant, Alexander Roth, Adrian Schneider, and Gaston H. Gonnet, OMA, a Comprehensive, Automated Project for the Identification of Orthologs from Complete Genome Data: Introduction and First Achievements A. McLysaght et al. (Eds): RECOMB 2005 Workshop on Comparative Genomics, LNBI 3687, pp. 61-72,. [PDF] [Link]
 
Verification of Stable Pairs:
[2] Christophe Dessimoz, Brigitte Boeckmann, Alexander Roth, Gaston H. Gonnet, Detecting Non-Orthology in the COGs Database and Other Approaches Grouping Orthologs Using Genome-Specific Best Hits, Nucleic Acids Res, 2006 Jul 11;34(11):3309-3316 [Open Access]
 
Estimating confidence intervals in Stable Pair computation:
[3] Christophe Dessimoz, Manuel Gil, Adrian Schneider, Gaston H. Gonnet, Fast Estimation of the Difference between two PAM/JTT Evolutionary Distances in Triplets of Homologous Sequences , BMC Bioinformatics 2006, 7:529 [Open Access]
 
Summary

The OMA algorithm uses as input the sequences from complete genomes and yields orthologous relations and groups of orthologs. The process works in 4 phases:
  1. 1.All-against-all, thepairwise alignments between every pair of proteins of all genomes (“all-against-all”). For significant ones, an evolutionary distance is estimated in PAM units.
  2. 2. Formation of “stable pairs” (SPs), the mutually closest  pairs within a confidence interval, to avoid exclusion of n:m orthologs.
  3. 3.Detection of paralogs among SPs using witnesses of non-orthology [2]. Pairs surviving this step are named “verified pairs” (VPs). At this point, they are very likely to be orthologs.
  4. 4.Formation of OMA groups, made of a subset of proteins in which all pairs are VPs (clique).
All-against-all
The sequences used are available from public databases (mainly GenBank  for Prokaryotes and Ensembl for Eukaryotes). All data are checked for consistency. In cases of alternate splicing variants, we select the longest isoform, as well as the set of isoforms that have at least 10% non-redundant positions.
 
Homology is established in two sub-steps: first, full alignments using local dynamic programming (Smith-Waterman) are performed between all sequences using the fixed 224 updated PAM matrix (Gonnet et al. 1992) to find all homologous sequences. Second, all significant alignments (score > 85) are refined using the PAM scoring matrix that maximizes the similarity score. The PAM unit of that matrix corresponds to the maximum likelihood distance estimator.
 
We use local alignments because they are more sensitive to detect homology. However, we have an additional length requirement to exclude cases  in which only a domain or a fragment match. The length of the significant alignment must cover least 68% of the longer sequence.
 
Formation of Stable Pairs
In OMA, we call a stable pair two sequences xi and yj in two genomes X and Y that have no significantly closer homolog than each other. In the figure below, the arrows illustrate the distance between the sequences in two genomes:
The sequences x1 in X and y1 in Y form a stable pair because there is no sequence with a significantly closer distance to either of the two sequences. The same holds for sequences x1 and y2, but sequences x2 and y2 do not form a stable pair, because x1 is significantly closer to y2 than x2. Note that the use of confidence intervals is necessary to take into account many-to-many orthologous relations, which arise when duplications have taken place after speciation. Besides, no distance estimation is exact, and thus it is possible that true orthologs have not the shortest estimated distance. Please refer to the article cited above for more details on the estimation of the variance and covariance. We use k=1.96, corresponding to the 95% confidence interval.
 
Detection of Paralogs
The verification of each stable pair (x1-y2) orthology is performed through an exhaustive search in all third-party genomes Z for two genes z3 and z4 (“the witnesses of non-orthology”) that fulfill the following conditions (the derivation and justification of these results can be found in [2]):
 
 
There, we also use k=1.96.
Stable pairs for which no witness of non-orthology could be found are called “verified pairs” and are very likely to be orthologs. The OMA browser reports all verified pairs involving a sequence in the protein-centric view.
 
Formation of OMA groups
For some applications, such as phylogenetic tree construction, it may be of interest to have groups of orthologs. We have the requirement that in an OMA group, every protein is orthologous to every other protein in the group. Consequently, an OMA group has at most one sequence per genome.
We construct groups by constructing the graph of verified stable pairs where vertices are the sequences and the edges are the verified stable pairs. Each edge is weighted by the similarity score between the two connected sequences. We then identify in this graph maximum weight cliques. Knowing that the clique problem is NP-complete, we use a heuristic approach.
 
Summary by C. Dessimoz, A. Schneider and A. Roth ({cdessimo,schneadr,alexander.roth} at inf.ethz.ch), May 2007.