Peer-reviewed
articles
Overview
of OMA:
Verification of Stable Pairs:
Estimating confidence intervals in Stable Pair computation:
Summary
The
OMA algorithm uses as input the sequences from complete genomes and
yields orthologous relations and groups of orthologs. The process works
in 4
phases:
-
1.All-against-all,
thepairwise
alignments between every pair of proteins of all genomes
(“all-against-all”). For significant ones, an
evolutionary distance is
estimated in PAM units.
-
2.
Formation of “stable pairs” (SPs),
the mutually closest pairs within a confidence
interval, to avoid exclusion of n:m orthologs.
-
3.Detection
of
paralogs
among SPs using witnesses of non-orthology [2]. Pairs
surviving this step are named “verified pairs”
(VPs). At this point, they are
very likely to be orthologs.
-
4.Formation
of OMA
groups,
made of a subset of proteins in which all pairs are VPs
(clique).
All-against-all
The
sequences used are available from public
databases (mainly GenBank for Prokaryotes and Ensembl for
Eukaryotes). All
data are checked for consistency.
In cases of alternate splicing variants,
we select the longest isoform,
as well as the set of isoforms that have at least 10% non-redundant
positions.
Homology
is established in two sub-steps:
first, full alignments using local dynamic programming (Smith-Waterman)
are
performed between all sequences using the fixed 224 updated PAM matrix
(Gonnet
et al. 1992) to find all homologous sequences. Second, all significant
alignments (score > 85) are refined using the PAM scoring matrix
that
maximizes the similarity score. The PAM unit of that matrix corresponds
to the
maximum likelihood distance estimator.
We
use local alignments because they are more
sensitive to detect homology. However, we have an additional length
requirement
to exclude cases in which only a domain or a fragment match.
The length of the significant alignment must cover least 68% of
the longer sequence.
Formation
of Stable Pairs
In
OMA, we call a stable pair two
sequences xi
and yj
in two genomes X and Y that have no significantly closer homolog than
each other. In the figure below, the arrows illustrate the distance
between
the sequences in two genomes:
The
sequences x1
in X and y1
in Y form a stable pair because there is no sequence with a
significantly closer distance
to either of the two sequences. The same holds for sequences x1
and y2,
but sequences
x2
and y2
do not form a stable pair, because x1
is significantly closer to
y2
than x2.
Note that the use of confidence intervals is necessary to take into
account many-to-many orthologous relations, which arise when
duplications have
taken place after speciation.
Besides, no distance estimation is exact, and
thus it is possible
that true orthologs have not the
shortest estimated distance. Please
refer to the article cited above for more details
on the estimation of the variance and covariance. We
use k=1.96, corresponding to the 95% confidence
interval.
Detection of Paralogs
The verification of each stable pair (x1-y2)
orthology is performed through an exhaustive search in all third-party
genomes
Z for two genes z3
and z4
(“the witnesses of non-orthology”) that
fulfill the following conditions (the derivation and justification of
these
results can be found in [2]):
There,
we also use
k=1.96.
Stable
pairs for which no witness of non-orthology could be found are
called “verified pairs” and are very likely to be
orthologs. The OMA browser
reports all verified pairs involving a sequence in the protein-centric
view.
Formation
of OMA groups
For
some applications, such as phylogenetic tree construction, it may be of
interest to have groups of orthologs. We have the requirement that in
an OMA
group, every protein is orthologous to every other protein in the
group.
Consequently, an OMA group has at most one sequence per genome.
We
construct groups by constructing
the graph of verified stable pairs where vertices are the sequences and
the
edges are the verified stable pairs.
Each edge is weighted by the similarity score between the two
connected sequences. We then identify in this graph maximum weight
cliques.
Knowing
that the clique
problem is NP-complete, we use a heuristic approach.
Summary
by C. Dessimoz, A. Schneider and A. Roth
({cdessimo,schneadr,alexander.roth} at inf.ethz.ch), May 2007.