For this workshop, we are going to make a consensus annotation for the genome of Drosophila melanogaster. We already made multiple annotation using different process:
abinitio.gff3
)rna.gff3
)homology.gff3
)With OMAnnotator, we wish to combine these annotations to extract annotations with support from multiple sources, in particular closely related species. For this we use the OMA orthology inference framework.
Setup
To start this workshop, we need:
- To install the OMAnnotator software. For this head to https://github.com/DessimozLab/OMAnnotator/
and follow the installation instructions
- To download the input data described above, which are available on Zenodo. For this, you can use :
wget https://zenodo.org/records/15322427/files/InputData.tar.gz
tar -x vf InputData.tar.gz
wget https://zenodo.org/records/15322427/files/OMA_Output.tar.gz
tar -xvf InputData.tar.gz
1.1 Selecting species for evolutonary information
The main benefit of OMAnnotator is the possibility to use precomputed evolutionary information from other species to inform the consensus. The more species are selected, the more information but the slower the computation. Let's select six species to include in the run. We will export them from the OMA Browser webiste: www.omabrowser.org.
1. What are the Family and Order of Drosophila melanogaster
2. Research the taxonomy division and identify six close species to add.
Now, we can download the All-Against-All sequence comparison of those species. This is precomputed data from OMA that will speed up the process.
3. Go to the OMA Browser and export the AllAll for the selected species
The Downloaded archive is a copy of a full OMA Standalone directory. The main folders you need to know are:
(rna,homology,abinitio)))),HERIL));';
1.3 OMAnnotator prepate_data module
Now that we have the archive, we can use OMAnnotator to process our annotation and add them to the folder.
This is all done with OMAnnotator prepare_data
module. But a good practice is to first take a look at the data.
4. Open the source annotation (GFF) files, and notice how the genes, mRNA, and CDS are described.
Depending on the source of the annotation, the descriptor of each element may change.
This may sometimes confuse OMAnnotator which need to understand which element is which.You can provide a feature type file to OMAnnotator.
It needs to be a tab separated file with, in orde:
filename<TAB>gene feature<TAB> transcript feature<TAB>CDS feature
5. Write the feature type file for your input files
abinitio.gff3\tgene\tmRNA\tCDS
homology.gff3\tgene\tmRNA\tCDS
transcript.gff3\tgene\ttranscript\tCDS
6. Take a look at the RNA file. How many genes are there in it? How many transcripts? How do you explain this?
grep -c
10,202 genes.7. Run the prepare_data module of OMAnnotator.
python OMAnnotator/OMAnnotator.py prepare_data -h
to find out which parameters to use.
8. Check the FASTA folder. How many sequences are there in the rna.file?
grep -c '^>' FASTA/rna.fa
9. OMAnnotator also generates so-called Splice files which map gene and alternative transcripts for OMA usage. Does the content of this file fit what you know?
1.2 Prepare the species tree
You need to set the species tree in the OMA Standalone directory to include your new annotation.
For this, you need to add a branch containing your annotation file's prefix at the location in the tree where your species of interest should be.
The added part shoud look like this:
(rna,homology,abinitio)
10. Add the annotation to the SpeciesTree in parameters.drw file, at the node corresponding to the D.melanogaster species.
The second part of the pipeline is a 'classic' run of the OMA Standalone pipeline. OMA is a pipeline for orthology inference. For more information on OMA standalone, please see this blog post and the extensive documentation available here.
Execution can be made in three steps:
The database creation step. To run with:
./bin/oma -c
(Expected execution time: 2min)
The all-against-all computation. In our case, this part would still take hours if not parallelized. Parallelization is straightforward when using a slurm environment. The normal command is:
././bin/oma -s
The Hierarchical Orthologous Groups inference step. This part wrap up the OMA inference and gives the output needed by OMAnnotator.
The command is:
./bin/oma
(Expected execution time: 30min)
For this tutorial, try running the first few commands to experiment the execution. Then, we provide precomputed results downloadable on Zenodo (see above in the Setup section.)
The most important output of OMA for OMAnnotator is the HierarchicalGroups.orthoxml file in the Output folder.
1. Take a look at the OrthoXML file in terminal.
2. Open the phylostrat.html file. How many genes are there at the node from which all the input annotation are connected?
3. What are the main events reported in branch leading to each annotation?
To obtain the OMAnnotator results, there is only need to extract the relevant information from the OMA output.
We can do this with the extract_consensus
module.
1. Use OMAnnotation to extract a consensus with the relevant parameters?
python OMAnnotator/OMAnnotator.py prepare_data -h
0
python OMAnnotator/OMAnnotator.py extract_consensus -a GFF/ -f FASTA/ -x OMA.2.6.0/Output/HierarchicalGroups.orthoxml -st OMA_After/Output/ManualSpeciesTree.nwk -t featuretype.tsv -o Consensus/consensus
2. Look at the generated files, how many genes were extracted?
3. According to the report, what is the most common source in the annotation?
4. What is the source of most gene models?
It is possible to use the priorities (-p) parameter to instruct OMAnnotator to use gene models from one or the other sources in priority. Otherwise, the gene model is selected as a function of the length of the predicted protein sequence it support.
5. Use the extract_consensus
to select models with rna support then homology support in priority.
python OMAnnotator/OMAnnotator.py extract_consensus -a GFF/ -f FASTA/ -x OMA.2.6.0/Output/HierarchicalGroups.orthoxml -st OMA_After/Output/ManualSpeciesTree.nwk -t featuretype.tsv -o Consensus/consensus -p rna,homology,abinitio