OMA standalone is a piece of software which makes it possible to run the OMA algorithm for inferring homology information on your custom data. This includes generating pairwise orthologs, Hierarchical Orthologous Groups, as well as OMA Groups. It takes as input the coding sequences of genomes or transcriptomes, in FASTA format. The recommended input type is amino acid sequences, but OMA also supports nucleotide sequences. With amino acid sequences, users can combine their own data with publicly available genomes from the OMA database, including pre-computed all-against-all sequence comparisons (the first and computationally most intensive step), using the export function on the OMA website.
In this exercise, we will run OMA standalone to obtain gene families and other orthology information for a few bacterial species. We will download four genomes from the OMA browser, before adding our own custom genome as an example.For more information on OMA standalone, please see this blog post and the extensive documentation available here.
tar -zxvf AllAll-...
Note: it is important to know that when using your own genomes, or adding a genome to the exported all-against-all data, the name of the proteome file will be used as the name of the genome throughout the rest of the analysis.
3. Add the following dummy bacterial genome to your dataset: my_bacterial_genome.fa
grep
to select the header lines (starting with ">"), before using wc
. Alternatively, you could use Bio.SeqIO
, or similar, in Python.
parameters.drw
file. This file is located in the main OMA directory and should be edited by the user. There are many options that can be tweaked, but there are two options to specifically pay attention to: SpeciesTree
and OutgroupSpecies
.
Note: here, we shall not edit the SpeciesTree
parameter. Instead, we shall let OMA estimate it. For future reference, this estimation should be used with extreme caution and the resulting EstimatedSpeciesTree.nwk
file should be examined.
5. Edit the parameters.drw
file and specify the outgroup species to be Magnetococcus marinus
The OMA algorithm runs in three main steps: 1) Quality and consistency checks of the genomes that will be used to run OMA Standalone; 2) All-against-all alignments of every protein sequence to all other protein sequences; and 3) Orthology inference, in the form of: pairwise orthologs, OMA Groups, and Hierarchical Orthologous Groups (HOGs). For more information on these types of orthologs output by OMA, see OMA: A Primer (Zahn-Zabal et al. 2020). The all-against-all step is the most computationally intensive and takes the longest amount of time. This is why it is beneficial to export the precomputed all-against-all for genomes in the OMA browser.
Cache/AllAll
`bin/oma`
Now that OMA standalone is complete, the Output
folder should be created - have a look at the contents.
wc -l *
///R
Bacillus anthracis (strain A0248) and Staphylococcus aureus (strain TW20 / 0582); 2009 pairwise orthologs
cat STRZN-BACAA.txt | cut -f 5 | sort | uniq -c
HOGFasta
folder and loop through each file to count the number of genes.
ls -1 | grep ".fa" | sed "s/.\*/grep -c \">\" &/" | bash | awk '{ total += $1 } END { print total/NR }'
OrthologousGroups.txt
.