Module 2: Fast placement of sequences into HOGs with OMAmer

Sometimes you might have a few protein sequences from a genome which is not in the OMA database and you want to quickly find out which genes they share homology with. Or perhaps you even want to do this with a whole proteome.

OMAmer is a command-line software that places a given protein sequence onto one of the gene families available in the input OMA database. In other words, OMAmer finds the most likely HOG where the input protein belongs. OMAmer is based on comparing k-mers (substring of the sequence of k length) between a query sequence and HOGs. Since it only searches for k-mers that are in common between sequences, it does not need a sequence alignment (which is usually computationally intensive) and is a very fast alternative to high-resolution homology determination when one is simply looking for the gene family a sequence belongs to.

Back to home / Reset

2.1 OMAmer setup and requirements

To use OMAmer you need to 1) have the program installed on your machine or environment and 2) have the right OMAmer database.

Make sure OMAmer is installed on your machine. Type omamer search -h on the command line. If your screen shows instructions to run OMAmer, skip to the next point. Otherwise, you can install OMAmer with pip: pip install omamer

An OMAmer database is built from an OMA instance, and stores all the HOGs and their k-mer contents. OMAmer databases can be constructed from the whole OMA Browser database or for specific clades in the OMA database. In this tutorial, we will use clade-specific OMAmer databases.

Make sure you have an OMAmer database file on your machine. If you are using Gitpod, write ls omamer_databases . You should see two files: ‘Metazoa.h5’ and ‘Saccharomyceta.h5’. Otherwise please download these databases on the OMA Browser. When using an OMAmer database, it is a good practice to check and report what parameters were used to create it, and which OMA release was used to create it. Check the information from the ‘Saccharomyceta.h5’ database.

  • 1. How many species are included in the Saccharomyceta database?

    Hint: Use omamer info --db omamer_databases/Saccharomyceta.h5

    155 species.

  • 2. What k-mer size was used to create this OMAmer database?

    6-mers (k-mers of size 6)

  • 3. What is the minimal size of HOGs integrated into this OMAmer database?

    You can check the row “min fam size”.

    A minimum of 6 proteins in a HOG.

2.2 Placing a few sequences into Hierarchical Orthologous Groups

In this module, you are interested in a few protein sequences from the mammoth Mammuthus primigenius and would like to find the corresponding HOGs for further phylogenomic analyses. For this, you will need the input files which contain the protein sequences to be in FASTA format. You will also need to use the OMAmer command line or the OMA Browser.

Use the command line to place the proteins in the file MAMPROT.fa into HOGs. For this, use the command omamer search.

  • 1. What would be the command line for omamer search if you want the results to be output into a file called MAMPROT.hogmap.txt?

    You can find the help by typing omamer search --help

    omamer search --db omamer_databases/Metazoa.h5 --query MAMPROT.fa --out MAPROT.hogmap.txt

Optional You can also do an OMAmer search directly on the OMA Browser! Go to Tools->Fastmapping and upload the MAMPROT.fa file. After a few minutes, you can download a results file similar to the one above.

The family score corresponds to the proportion of k-mers in common with the HOG it was placed into: 1 being all the k-mers and 0 being none.
The overlap is the proportion of the sequence between the first k-mer and last k-mer of the sequences that maps to these HOGs - it gives an estimate of whether the whole sequence or only part of it corresponds to the HOG.

  • 2. How confident can we be in the mammoth sequences’ placement into HOGs?

    The two proteins have both a high score and high overlaps. The placements are high confidence.

  • 3. Given the root HOG placement, what proteins are MP1 and MP2?

    If the HOG description is not informative, take a look at the HOG members.

    • MP1: Hemoglobin alpha
    • MP2: Myoglobin

2.3 Placing a whole proteome

You just sequenced and annotated a whole proteome for a yeast species, and would like to get orthology information for this proteome of interest. You can use OMAmer to place all the sequences onto the HOGs, which takes only a few minutes. OMAmer results are the starting point for high-resolution orthology inference with FastOMA, which is the topic of the next module.

Use OMAmer to place sequences for the NewYeastProteome.fa file into HOGs from the Saccharomyceta.h5 database.

  • 1. What is the command to do this?

    omamer search --db omamer_databases/Saccharomyceta.h5 --query NewYeastProteome.fa --out NewYeastProteome.hogmap.txt

  • 2. How many proteins are there in the proteomes? How many for which no homologs are found?

    Proteins for which no homologs are found are still shown in the result file but all information regarding HOGs have “na” as a value. You can use wc <file> to count the number of lines in a file and grep ‘\sna\s’ <file>| wc to count lines with occurrences of a ‘na’s

    14445 proteins of which 3492 have no found homologs.

The yeast gene CDC12 encodes a septin, a protein first identified and yeast and essential for cell division in many Eukaryotes. CDC12 is part of HOG:D0671290.5i.

  • 3. What is the ortholog of this gene in the new yeast species?

    Look for this HOG identifier in the file with grep.

    CDC12’s ortholog is A7EQW0_SCLS1