Biodiversity Genomics Academia : OMA and OMArk for homology exploration and gene annotation quality control

Welcome to the OMA and OMArk session. To go through this session, please use our GitPod for command line and steps that need computing. For the first module, you will need to reach the OMA Browser

Back to home / Reset

1. The OMA Browser and HOGs

The OMA database is a publicly available orthology database with a yearly release cycle. From the latest release, it contains orthology information for 2,927 species - 1,964 bacteria, 174 archaea, and 789 eukaryotes. The OMA Browser database serves as the basis for both OMAmer and OMArk. In this session, we will go through the easiest way to get information from the Browser and get the most out of OMArk.

For this module, go to the OMA Browser.

Browsing the species content in the OMA Browser:

The species content of the current release of the OMA Browser is listed under the 'Explore' > 'Species/release information' section of the website.

1. How many arthropods and mammalian genomes represented in OMA?

Under the release page, search for clades in the search bar

92 arthropods and 78 mammals

If you are interested in specific clades, you can also search for them using the main search bar on top of the Browser.

2. Look up the page about the Obtectomera clade. How many species of this clade are in OMA?

Type Obtectomera then select “Taxon” as search field

There are 7 species, including 5 butterflies
3. Look up information about the species Papilio machaon. What is the identifier of its assembly? What is its release date?

Click on the species name, then at the DB release category.

The assembly identifier is GCF_912999745. It comes from the 2022 release of Refseq.

Browsing HOGs

HOGs, or Hierarchical Orthologous Groups, are representations of gene families. A HOG is a set of genes that descended from a common ancestral gene in a specific ancestral species (i.e., at a specific taxonomic level). HOGs are hierarchical because groups defined at more recent clades are encompassed within larger groups that are defined at older clades, making them nested subfamilies HOGs are the main data representation in OMA and allow for exploring the evolution of a gene family. They also are the main data used for OMArk's quality assessment.

The following exercises are focused on analyzing the evolutionary history of a gene family. For an introduction on how to use the iham graphical viewer (needed to answer the following questions), see our documentation and YouTube video.

Getting HOGs from genomes

From the OMA Browser, click on the link to the rat P53 gene or search for the protein P53_RAT in the search bar. Then click on the Groups button and select HOG. We display the deepest HOG in which the protein is present - a gene sub-family. You can go to the largest HOG in which this gene is present (known as a “Root HOG” in OMA) by clicking on the first taxon on the taxa list below the header.

4. From which taxon did this gene family originate?

Euteolostomi
5. In which species is this gene most commonly duplicated?

The number of box on each row indicate the number of genes in this family per species.

Loxodonta africana

Duplications cause new HOGs to be created under the Root HOG. HOG. When they do, their identifier becomes the one of the top-level HOG appended with ‘.[number][letter]’.

6. Look for information about the HOG E0753924.1a. It originated from a duplication in Neopterygii. Click on the Neopterygii node on the tree, how does the duplication appear in the visualization?

The boxes are separated by a line, which show this genes was already present in two copy in the last common ancestor of this clade. Genes from E0753924.1a are shown in green, those from another subHOG ar shown in grey.

It is possible to look at member proteins of a HOG and download corresponding sequences through the "Members" tab.

7. How many sequences are in the HOG E0753924.1a? You can download their sequences in the FASTA format.

There are 48 protein members of this HOG.

The classification of genes into gene families and subfamilies through HOGs provides information about their evolutionary relationships. By taking all the HOGs at a given taxonomic level , we can also estimate the ancestral gene content for any given clade.

2. Placements with OMAmer

This session needs GitPod and should be run from /workspace/oma-omark/working_dir/
The input FASTA files for OMAmer are in the subfolder Proteomes and the OMAmer database in the folder DB

OMArk is a pipeline that assess the quality of proteomes. i.e. the coding-gene repertoire of species. It is based on comparison with OMA HOGs. As a practical example, we will assess the quality of the proteome of the narwhal, Monodon monoceros.
The first step of OMArk is placement of genes into HOGs with the OMAmer software. It provides a fast way to find representatives of known gene families in a query proteome. OMAmer compares the k-mers (words of k characters) between sequence and OMA HOGs to place the sequence into an existing HOGs.
To accomplish this, a special database is required, which mirrors the content of the OMA database's HOGs and their sequences' k-mers. OMA also offers lightweight, clade specific databases. However, OMArk works at its best when using a comprehensive database, called LUCA.h5 and available at https://omabrowser.org/All/LUCA.h5, available under the DB folder.

An OMAmer database is big (>10GB) and for now, running the software requires a high amount of RAM. We recommend performing the analysis on a computing cluster rather than on a laptop. As running it on GitPod could take time, we provide pre-computed OMAmer result files for this session. If you accidently overwrote the ones in the omamer folder of your working space, you can copy those in the folder workspace/oma-omark/expected_outputs/omamer/ .

1. The command that performs gene placement with OMAmer is omamer search .What command would you write to run OMAmer with query proteome Monmon.fa ?

Use omamer search --help to get documentation about the parameters.

omamer search --db DB/LUCA.h5 --query Proteomes/Monmon.fa --out omamer/Monmon.omamer.txt -t 4
2. Take a look at the output file (omamer/Monmon.omamer.txt). The results indicate, for each protein, its best placement into HOGs. What is the placement for the protein A0A4U1FQC7_MONMO ? What is this HOG's taxonomic rank in the OMA Browser?

HOG:E0745618.9b.1b . Its rank is Cetacea

The family_p is a score that represents the unlikelihood of k-mer matching by chance. The higher the score, the less chances that k-mer matching happened randomly. The score is the negative logarithm of the p-value of having as many k-mers in common with the selected rootHOG under a binomial distribution.
The overlap is the proportion of the sequence between the first k-mer and last k-mer of the sequences that maps to these HOGs - it gives an estimate of whether the whole sequence or only part of it corresponds to the HOG.

3. Based on these scores, would you consider the placement of A0A4U1FQC7_MONMO to be of high confidence?

The family score is high (2237) and the subfamily score is close to 1. Very high confidence.

Some proteins are not assigned to any HOGs as the k-mer overlap was not significant. These proteins are marked with "na" indicating their unassigned status.

4. How many proteins are not assigned to any HOG in this file?

Use grep ‘N/A N/A N/A ’ <file> | wc

555 sequences with no hits over 20,731

OMAmer placements can be used to easily obtains information and the sequences of homologs of all the sequences of a proteome. They can be combined with the OMA Browser to start phylogenomic analysis while skipping time-consuming orthology prediction. They can also serve as input to run OMArk quality assesments.

3. Proteome quality assesment with OMArk

This session needs GitPod and should be run from /workspace/oma-omark/working_dir/
The input FASTA files for OMArk are in the subfolder Proteomes, omamer and the OMAmer database in the folder DB

OMArk uses the output from OMAmer and OMA's data on ancestral gene content to assess the quality of a proteome in terms of completeness and consistency. In the final part of the tutorial, we will explore how to execute it and interpret the results.

You can execute OMArk with the omark command using the command line, as the software should be installed in your workspace. If it is not already installed, you can do so by running pip install omark in your environment.

1. What command would you write to run OMArk on the Monmon OMAmer file?

Type omark --help to see possible parameters

omark -d DB/LUCA.h5 -f omamer/Monmon.omamer.txt -o omark/Monmon_results -v

OMArk will take around 10 minutes to complete. Some information is already available in the command line output. When no taxonomy information id given by the user, OMArk determines the species composition based on the OMAmer placement.

2. Which ancestral lineage did OMArk select?

Ancestral lineage: Artiodactlyla. The ancestral lineage is the subset of the chosen taxon which has at least 5 species in OMA and that we use for completeness and consistenct assesmebt

Completeness assessment

OMArk searches for HOGs that contain genes for at least 80% of the ancestral lineage’s descendant species that are present in OMA. These conserved genes are expected to be present in the proteome and serve as a proxy for completeness

3. In the output of OMArk, open either the file with the extension "_detailed_summary.txt" (human-readable) or ".sum" (machine-readable) to access the completeness assessment. How many HOGs were used for the completeness assessment?"

13,044
4. How many genes are reported as missing in the proteomes?

931 / 7.14%

OMArk reports two categories of duplicates: Expected duplicates occur when conserved ancestral HOGs have known sub-HOGs to which sequences are mapped, while Unexpected duplicates occur when multiple genes are placed into the same HOGs at the ancestral level.

5. In which category do the majority of duplicate genes in the proteomes belong?

Most of them - 1236 of the 1273 are reported as unexpected duplicated

You can explore the HOGs that are reported in different categories in this file by opening the file with extension “.omq”. The categories in this file, marked by a starting ‘>’ are:

Single: Single copies
Lost: Missing
Duplicated: Duplicated unexpected
Underspecific: Single copies - when the placement is into a parent HOG
Overspecific_S: Single copies - when the placement is into a single child HOG
Overspecific_D: Duplicated expected

Consistency assessment

The consistency assessment in OMArk uses OMA information to address several questions: Do we find homologs for the predicted proteins? Are those homologs found in clades consistent with the query taxonomy? Do they exhibit similar protein structures?

6. How taxonomically consistent are the proteins in this proteome?

Most of the proteins (94.37%) are taxonomically consistent. We note, however, some taxonomically inconsistent ones (possible undetected contaminants or genes hitherto unknown in the clade) and unknown (genes with no known homologs or potential errors.
7. Would you say the gene structures in this proteome are of high quality?

Structurally inconsistent genes are marked as Partial hits (less than 80% of the total sequence length was used for the placement) or fragments (they have a length less than half of the median sequence length for the clade)

Gene structures look like they are of medium quality with around 10.68% partial placements and 13.51% fragments
8. How many Unknown proteins are there?

555. It is same number as the genes with N/A in OMAmer.

The sequences for each category can be found in the output file with the '.ump' extension. You can search for the identifier of these fragmented genes or partial hit genes in the OMAmer output and refer to the corresponding HOGs to investigate the results.

9. For instance, let's look up the fragmented protein A0A4U1EN85_MONMO in the OMAmer output. What can you conclude about it?

You can use grep A0A4U1EN85_MONMO omamer/Monmon.omamer.txt
Pay particular attention to the last two columns that indicate the query sequence length and the median sequence length in the subfamily.

The protein is placed into a HOG with a median protein length of 804, but it is only 344 amino acids long.
By searching for the HOG identifier in the OMAmer file, we can observe that another gene is also assigned to this same HOG (HOG:E0726858), with a sequence length of 499. This HOG is also reported as duplicated in OMArk's results. This “duplication” is likely caused by a fragmented gene sequence.

10. We have another proteome available for this species, which is obtained from the NCBI. The proteome from the NCBI contains isoforms of the same gene (available in Proteomes/Monmon_NCBI.splice). Therefore, we will provide a file to OMAmer that specifies which proteins come from the same gene. What is the command to run OMArk on the MonMon_NCBI.omamer.txt file?

omark -f omamer/Monmon_NCBI.omamer.txt -i Proteomes/Monmon_NCBI.splice -d DB/LUCA.h5 -o omark/Monmon_NCBI_results/ -v
11. What can you say about this proteome in comparison to the previous one?

It is both more complete and more consistent in terms of gene content.

OMArk website

OMArk can also be used on its website: https://omark.omabrowser.org/. The website contains data about public proteomes from multiple sources, making it easy to compare results with those public data. This is particularly relevant because OMArk results can vary between clades and are most useful for relative comparisons

12. Look for the clade Cetacea on the website. What can you say about the proteomes with outlier proteome size (number of proteins)?

Click on select Taxon and type Cetacea

Proteomes with low gene count are reported as incomplete by OMArk, the one protein with higher gene count has higher fragments (Monodon monoceros from UniProt)
13. Is the NCBI Monocedon monoceros proteome of good quality for its clade?

Yes, its proportion of unknown and missing genes are close to what is observed in other proteomes. (Note that getting 0% may be unrealistic)
14. In this view, open the proteome of Physeter catodon from Ensembl. What is it contaminated with?

A parasitic apicomplexa from the Sarcocystidae clade.

Finally, upload a proteome of your choice (perhaps downloaded from UniProt) on the OMArk website. You can easily obtain the same results file as shown before by clicking the Download button. You can also easily compare your proteome to closely related ones by clicking on the “Change to comparison view” button.

Category	Prefixes
Genes	id, go, ec, description, domain, sequence
HOGs	hog, sequence
OMA Groups	omagroup, fingerprint, sequence
Taxon	species, taxid, taxon

Category	Prefixes
Genes	id, go, ec, description, domain, sequence
HOGs	hog, sequence
OMA Groups	omagroup, fingerprint, sequence
Taxon	species, taxid, taxon