Module: OMAnnotator - Consensus annotation using evolutionary data

For this workshop, we are going to make a consensus annotation for the genome of Drosophila melanogaster. We already made multiple annotation using different process:

  • an ab initio annotation with Augustus (abinitio.gff3)
  • an annotation using Transcriptomic data made with StringTie (rna.gff3)
  • an annotation using homology made with Gemoma (homology.gff3)

With OMAnnotator, we wish to combine these annotations to extract annotations with support from multiple sources, in particular closely related species. For this we use the OMA orthology inference framework.

Setup

To start this workshop, we need: - To install the OMAnnotator software. For this head to https://github.com/DessimozLab/OMAnnotator/ and follow the installation instructions - To download the input data described above, which are available on Zenodo. For this, you can use :
wget https://zenodo.org/records/15322427/files/InputData.tar.gz
tar -x vf InputData.tar.gz
wget https://zenodo.org/records/15322427/files/OMA_Output.tar.gz
tar -xvf InputData.tar.gz

Back to home / Reset

Part 1: Preparing the data

1.1 Selecting species for evolutonary information

The main benefit of OMAnnotator is the possibility to use precomputed evolutionary information from other species to inform the consensus. The more species are selected, the more information but the slower the computation. Let's select six species to include in the run. We will export them from the OMA Browser webiste: www.omabrowser.org.

  • 1. What are the Family and Order of Drosophila melanogaster

    Order: Diptera. Family: Drosophilidae

  • 2. Research the taxonomy division and identify six close species to add.

    Type the taxon name into the search bar and select "Taxon" as a field. When the taxon is not present, adjacent ones are provided.

    You can chose any six species, we suggest:

    • Drosophila pseudoobscura pseudoobscura
    • Drosophila simulans
    • Lucilia cuprina
    • Aedes aegypti
    • Megaselia scalaris
    • Hermetia illucens

Now, we can download the All-Against-All sequence comparison of those species. This is precomputed data from OMA that will speed up the process.

  • 3. Go to the OMA Browser and export the AllAll for the selected species

    Download > Export All/All > Select on the species tree

The Downloaded archive is a copy of a full OMA Standalone directory. The main folders you need to know are:

  • DB - Contains the source proteome to be used in OMA.
  • Cache - This is where the main computations are done, including All against All.

(rna,homology,abinitio)))),HERIL));';

1.3 OMAnnotator prepate_data module

Now that we have the archive, we can use OMAnnotator to process our annotation and add them to the folder.
This is all done with OMAnnotator prepare_data module. But a good practice is to first take a look at the data.

  • 4. Open the source annotation (GFF) files, and notice how the genes, mRNA, and CDS are described.

    For abinitio:
    gene - mRNA - CDS
    For homology:
    gene - mRNA - CDS
    For transcript:
    gene - transcript - CDS

Depending on the source of the annotation, the descriptor of each element may change.
This may sometimes confuse OMAnnotator which need to understand which element is which.You can provide a feature type file to OMAnnotator.
It needs to be a tab separated file with, in orde:

filename<TAB>gene feature<TAB> transcript feature<TAB>CDS feature

  • 5. Write the feature type file for your input files

    abinitio.gff3\tgene\tmRNA\tCDS homology.gff3\tgene\tmRNA\tCDS transcript.gff3\tgene\ttranscript\tCDS

  • 6. Take a look at the RNA file. How many genes are there in it? How many transcripts? How do you explain this?

    You can use grep -c 10,202 genes.
    14,230 transcripts.
    There are alternative transcritps in the annotation.

  • 7. Run the prepare_data module of OMAnnotator.

    Use python OMAnnotator/OMAnnotator.py prepare_data -h to find out which parameters to use.

  • 8. Check the FASTA folder. How many sequences are there in the rna.file?

    grep -c '^>' FASTA/rna.fa

    There are 14,230 sequences in the file.

  • 9. OMAnnotator also generates so-called Splice files which map gene and alternative transcripts for OMA usage. Does the content of this file fit what you know?

    10,201 lines (genes) - 14230 words (transcript)
    It fits what we know about this annotation.

1.2 Prepare the species tree

You need to set the species tree in the OMA Standalone directory to include your new annotation. For this, you need to add a branch containing your annotation file's prefix at the location in the tree where your species of interest should be.
The added part shoud look like this:
(rna,homology,abinitio)

  • 10. Add the annotation to the SpeciesTree in parameters.drw file, at the node corresponding to the D.melanogaster species.

    The node from which it originates should be the same node from the which other species from the same genus originate.

    SpeciesTree := '(AEDAE,((LUCCU,(BACDO,(DROSI,DROPS,

Part 2: Running OMA Standalone

The second part of the pipeline is a 'classic' run of the OMA Standalone pipeline. OMA is a pipeline for orthology inference. For more information on OMA standalone, please see this blog post and the extensive documentation available here.

Execution can be made in three steps:

  1. The database creation step. To run with: ./bin/oma -c
    (Expected execution time: 2min)

  2. The all-against-all computation. In our case, this part would still take hours if not parallelized. Parallelization is straightforward when using a slurm environment. The normal command is: ././bin/oma -s

  3. The Hierarchical Orthologous Groups inference step. This part wrap up the OMA inference and gives the output needed by OMAnnotator. The command is: ./bin/oma
    (Expected execution time: 30min)

For this tutorial, try running the first few commands to experiment the execution. Then, we provide precomputed results downloadable on Zenodo (see above in the Setup section.)

The most important output of OMA for OMAnnotator is the HierarchicalGroups.orthoxml file in the Output folder.

  • 1. Take a look at the OrthoXML file in terminal.

  • 2. Open the phylostrat.html file. How many genes are there at the node from which all the input annotation are connected?

    The number of genes in the 'ancestor' of the node is on the left of the branch.

    13,224 genes.

  • 3. What are the main events reported in branch leading to each annotation?

    There are many losses on each branch, which may mean each annotation method is lacking some of the genes in the true gene set.

Part 3: Extracting the consensus

To obtain the OMAnnotator results, there is only need to extract the relevant information from the OMA output.
We can do this with the extract_consensus module.

  • 1. Use OMAnnotation to extract a consensus with the relevant parameters?

    You can identify the relevant parameters with python OMAnnotator/OMAnnotator.py prepare_data -h0

    python OMAnnotator/OMAnnotator.py extract_consensus -a GFF/ -f FASTA/ -x OMA.2.6.0/Output/HierarchicalGroups.orthoxml -st OMA_After/Output/ManualSpeciesTree.nwk -t featuretype.tsv -o Consensus/consensus

  • 2. Look at the generated files, how many genes were extracted?

    13,224 sequences in the FASTA and genes in the GFF.

  • 3. According to the report, what is the most common source in the annotation?

    The ab initio annotation is the one supporting msot of the genes with more than 10,000 supported genes.

  • 4. What is the source of most gene models?

    The ab initio and rna annotations are the main provided for the finale gene set with more than 5,000 each.

It is possible to use the priorities (-p) parameter to instruct OMAnnotator to use gene models from one or the other sources in priority. Otherwise, the gene model is selected as a function of the length of the predicted protein sequence it support.

  • 5. Use the extract_consensus to select models with rna support then homology support in priority.

    python OMAnnotator/OMAnnotator.py extract_consensus -a GFF/ -f FASTA/ -x OMA.2.6.0/Output/HierarchicalGroups.orthoxml -st OMA_After/Output/ManualSpeciesTree.nwk -t featuretype.tsv -o Consensus/consensus -p rna,homology,abinitio