Module 2: Fast placement of sequences into HOGs with OMAmer

Sometimes you might have a few protein sequences from a genome which is not in the OMA database and you want to quickly find out which genes they share homology with. Or perhaps you even want to do this with a whole proteome.

OMAmer is a command-line software that places a given protein sequence onto one of the gene families available in the input OMA database. In other words, OMAmer finds the most likely HOG where the input protein belongs. OMAmer is based on comparing k-mers (substring of the sequence of k length) between a query sequence and HOGs. Since it only searches for k-mers that are in common between sequences, it does not need a sequence alignment (which is usually computationally intensive) and is a very fast alternative to high-resolution homology determination when one is simply looking for the gene family a sequence belongs to.

Back to home / Reset

2.1 OMAmer setup and requirements

To use OMAmer you need to 1) have the program installed on your machine or environment and 2) have the right OMAmer database.

For this exercise you can use our GitPod instance. The files needed for the exercises can be found at /workspace/SIBBiodiversityBioinformatics2024/Module2_OMAmer/working_dir

Make sure OMAmer is installed on your machine. Type omamer search -h on the command line. If your screen shows instructions to run OMAmer, skip to the next point. Otherwise, you can install OMAmer with pip: pip install omamer

An OMAmer database is built from an OMA instance, and stores all the HOGs and their k-mer contents. OMAmer databases can be constructed from the whole OMA Browser database or for specific clades in the OMA database. In this tutorial, we will use clade-specific OMAmer databases.

Make sure you have an OMAmer database file on your machine. If you are using Gitpod, go to the corresponding working directory and write ls omamer_databases . You should see two files: ‘Metazoa.h5’ and ‘Saccharomyceta.h5’. Otherwise please download these databases on the OMA Browser. When using an OMAmer database, it is a good practice to check and report what parameters were used to create it, and which OMA release was used to create it. Check the information from the ‘Saccharomyceta.h5’ database.

1. How many species are included in the Saccharomyceta database?

Hint: Use omamer info --db omamer_databases/Saccharomyceta.h5

155 species.
2. What k-mer size was used to create this OMAmer database?

6-mers (k-mers of size 6)
3. What is the minimal size of HOGs integrated into this OMAmer database?

You can check the row “min fam size”.

A minimum of 6 proteins in a HOG.

2.2 Placing a few sequences into Hierarchical Orthologous Groups

In this module, you are interested in a few protein sequences from the mammoth Mammuthus primigenius and would like to find the corresponding HOGs for further phylogenomic analyses. For this, you will need the input files which contain the protein sequences to be in FASTA format. You will also need to use the OMAmer command line or the OMA Browser.

Use the command line to place the proteins in the file MAMPROT.fa into HOGs. For this, use the command omamer search.

1. What would be the command line for omamer search if you want the results to be output into a file called MAMPROT.hogmap.txt?

You can find the help by typing omamer search --help

omamer search --db omamer_databases/Metazoa.h5 --query MAMPROT.fa --out MAMPROT.hogmap.txt

Optional You can also do an OMAmer search directly on the OMA Browser! Go to Tools->Fastmapping and upload the MAMPROT.fa file. After a few minutes, you can download a results file similar to the one above.

The family_p is a score that represents the unlikelihood of k-mer matching by chance. The higher the score, the less chances that k-mer matching happened randomly. The score is the negative logarithm of the p-value of having as many k-mers in common with the selected rootHOG under a binomial distribution.
The overlap is the proportion of the sequence between the first k-mer and last k-mer of the sequences that maps to these HOGs - it gives an estimate of whether the whole sequence or only part of it corresponds to the HOG.

2. How confident can we be in the mammoth sequences’ placement into HOGs?

The two proteins have both a high score and high overlaps. The placements are high confidence.
3. Given the root HOG placement, what proteins are MP1 and MP2?
Try searching for the HOG id in the OMA browser. If the HOG description is not informative enough, take a look at the HOG members from well-annotated organisms (E.g. Human).
MP1: Hemoglobin alpha

MP2: Myoglobin

2.3 Placing a whole proteome

You just sequenced and annotated a whole proteome for a yeast species, and would like to get orthology information for this proteome of interest. You can use OMAmer to place all the sequences onto the HOGs, which takes only a few minutes. OMAmer results are the starting point for high-resolution orthology inference with FastOMA, which is the topic of the next module.

Use OMAmer to place sequences for the NewYeastProteome.fa file into HOGs from the Saccharomyceta.h5 database.

1. What is the command to do this?

omamer search --db omamer_databases/Saccharomyceta.h5 --query NewYeastProteome.fa --out NewYeastProteome.hogmap.txt
2. How many proteins are there in the proteomes? How many for which no homologs are found?

Proteins for which no homologs are found are still shown in the result file but all information regarding HOGs have 'N/A' as a value in the second column. You can use wc NewYeastProteome.hogmap.txt to count the number of lines in a file and cut -f 2 NewYeastProteome.hogmap.txt | grep 'N/A' | wc to count lines with occurrences of 'N/A' in the second column. Do not forget to remove the 8 header lines from the total count!

14445 proteins of which 4745 have no found homologs.

The yeast gene CDC12 encodes a septin, a protein first identified in yeast and essential for cell division in many Eukaryotes. CDC12 is part of HOG:D0671290.5i.

3. What is the ortholog of this gene in the new yeast species?

Look for this HOG identifier in the file with grep.

CDC12’s ortholog is A7EQW0_SCLS1

Category	Prefixes
Genes	id, go, ec, description, domain, sequence
HOGs	hog, sequence
OMA Groups	omagroup, fingerprint, sequence
Taxon	species, taxid, taxon

Category	Prefixes
Genes	id, go, ec, description, domain, sequence
HOGs	hog, sequence
OMA Groups	omagroup, fingerprint, sequence
Taxon	species, taxid, taxon