The OMA database is a publicly available orthology database with a yearly release cycle. From the latest release, it contains orthology information for 2,851 species - 1,965 bacteria, 173 archaea, and 713 eukaryotes. The OMA Browser database serves as the basis for both OMAmer and OMArk. In this session, we will go through the easiest way to get information from the Browser and get the most out of OMArk.
The species content of the current release of the OMA Browser is listed under the 'Explore' > 'Species/release information' section of the website.
1. How many arthropods and mammalian genomes represented in OMA?
If you are interested in specific clades, you can also search for them using the main search bar on top of the Browser.
2. Look up the page about the Obtectomera clade. How many species of this clade are in OMA?
3. Look up information about the species Papilio machaon. What is the identifier of its assembly? What is its release date?
The following exercises are focused on analyzing the evolutionary history of a gene family. For an introduction on how to use the iham graphical viewer (needed to answer the following questions), see our documentation and YouTube video.
From the OMA Browser, click on the link to the rat P53 gene or search for the protein P53_RAT in the search bar. Then click on the Groups button and select HOG. We display the deepest HOG in which the protein is present - a gene sub-family. You can go to the largest HOG in which this gene is present (known as a “Root HOG” in OMA) by clicking on the first taxon on the taxa list below the header.
4. From which taxon did this gene family originate?
5. In which species is this gene most commonly duplicated?
Duplications cause new HOGs to be created under the Root HOG. HOG. When they do, their identifier becomes the one of the top-level HOG appended with ‘.[number][letter]’.
6. Look for information about the HOG D0637001.1a. It originated from a duplication in Neopterygii. Click on the Neopterygii node on the tree, how does the duplication appear in the visualization?
It is possible to look at member proteins of a HOG and download corresponding sequences through the "Members" tab.
7. How many sequences are in the HOG D0637001.1a? You can download their sequences in the FASTA format.
This session needs GitPod and should be run from
The input FASTA files for OMAmer are in the subfolder Proteomes and the OMAmer database in the folder DB
OMArk is a pipeline that assess the quality of proteomes. i.e. the coding-gene repertoire of species. It is based on comparison with OMA HOGs. As a practical example, we will assess the quality of the proteome of the narwhal, Monodon monoceros.
The first step of OMArk is placement of genes into HOGs with the OMAmer software. It provides a fast way to find representatives of known gene families in a query proteome. OMAmer compares the k-mers (words of k characters) between sequence and OMA HOGs to place the sequence into an existing HOGs.
To accomplish this, a special database is required, which mirrors the content of the OMA database's HOGs and their sequences' k-mers. OMA also offers lightweight, clade specific databases. However, OMArk works at its best when using a comprehensive database, called LUCA.h5 and available at https://oma-stage.vital-it.ch.org/All/LUCA.h5, available under the DB folder.
An OMAmer database is big (>10GB) and for now, running the software requires a high amount of RAM. We recommend performing the analysis on a computing cluster rather than on a laptop. As running it on GitPod could take time, we provide pre-computed OMAmer result files for this session. If you accidently overwrote the ones in the omamer folder of your working space, you can copy those in the folder
1. The command that performs gene placement with OMAmer is
omamer search .What command would you write to run OMAmer with query proteome Monmon.fa ?
omamer search --helpto get documentation about the parameters.
omamer search --db DB/LUCA.h5 --query Proteomes/Monmon.fa --out omamer/Monmon.omamer.txt
2. Take a look at the output file (
omamer/Monmon.omamer.txt). The results indicate, for each protein, its best placement into HOGs. What is the placement for the protein A0A4U1FQC7_MONMO ? What is this HOG's taxonomic rank in the OMA Browser?
The family and subfamily scores represent the proportion of k-mers of the sequence that overlap with k-mers the nested HOG, excluding those already shared with the encompassing HOG (for subfamilies). \
3. Based on these scores, would you consider the placement of A0A4U1FQC7_MONMO to be of high confidence?
Some proteins are not assigned to any HOGs as the k-mer overlap was not significant. These proteins are marked with "na" indicating their unassigned status.
4. How many proteins are not assigned to any HOG in this file?
grep ‘\sna\s’ <file> | wc
This session needs GitPod and should be run from
The input FASTA files for OMArk are in the subfolder Proteomes, omamer and the OMAmer database in the folder DB
OMArk uses the output from OMAmer and OMA's data on ancestral gene content to assess the quality of a proteome in terms of completeness and consistency. In the final part of the tutorial, we will explore how to execute it and interpret the results.
You can execute OMArk with the
omark command using the command line, as the software should be installed in your workspace. If it is not already installed, you can do so by running
pip install omark in your environment.
1. What command would you write to run OMArk on the Monmon OMAmer file?
omark --helpto see possible parameters
omark -d DB/LUCA.h5 -f omamer/Monmon.omamer.txt -o omark/Monmon_results -v
OMArk will take around 10 minutes to complete. Some information is already available in the command line output. When no taxonomy information id given by the user, OMArk determines the species composition based on the OMAmer placement.
2. Which ancestral lineage did OMArk select?
OMArk searches for HOGs that contain genes for at least 80% of the ancestral lineage’s descendant species that are present in OMA. These conserved genes are expected to be present in the proteome and serve as a proxy for completeness
3. In the output of OMArk, open either the file with the extension "_detailed_summary.txt" (human-readable) or ".sum" (machine-readable) to access the completeness assessment. How many HOGs were used for the completeness assessment?"
4. How many genes are reported as missing in the proteomes?
OMArk reports two categories of duplicates: Expected duplicates occur when conserved ancestral HOGs have known sub-HOGs to which sequences are mapped, while Unexpected duplicates occur when multiple genes are placed into the same HOGs at the ancestral level.
5. In which category do the majority of duplicate genes in the proteomes belong?
The consistency assessment in OMArk uses OMA information to address several questions: Do we find homologs for the predicted proteins? Are those homologs found in clades consistent with the query taxonomy? Do they exhibit similar protein structures?
6. How taxonomically consistent are the proteins in this proteome?
7. Would you say the gene structures in this proteome are of high quality?
8. How many Unknown proteins are there?
The sequences for each category can be found in the output file with the '.ump' extension. You can search for the identifier of these fragmented genes or partial hit genes in the OMAmer output and refer to the corresponding HOGs to investigate the results.
9. For instance, let's look up the fragmented protein A0A4U1EN85_MONMO in the OMAmer output. What can you conclude about it?
You can use
grep A0A4U1EN85_MONMO omamer/Monmon.omamer.txt
Pay particular attention to the last two columns that indicate the query sequence length and the median sequence length in the subfamily.
10. We have another proteome available for this species, which is obtained from the NCBI. The proteome from the NCBI contains isoforms of the same gene (available in
Proteomes/Monmon_NCBI.splice). Therefore, we will provide a file to OMAmer that specifies which proteins come from the same gene. What is the command to run OMArk on the MonMon_NCBI.omamer.txt file?
omark -f omamer/Monmon_NCBI.1.txt -i Proteomes/Monmon_NCBI.splice -d DB/LUCA.h5 -o omark/Monmon_NCBI_results/ -v
11. What can you say about this proteome in comparison to the previous one?
OMArk can also be used on its website: https://omark.omabrowser.org/. The website contains data about public proteomes from multiple sources, making it easy to compare results with those public data. This is particularly relevant because OMArk results can vary between clades and are most useful for relative comparisons
12. Look for the clade Cetacea on the website. What can you say about the proteomes with outlier proteome size (number of proteins)?
Click on select Taxon and type Cetacea
13. Is the NCBI Monocedon monoceros proteome of good quality for its clade?
14. In this view, open the proteome of Physeter catodon. What is it contaminated with?