FastOMA is a software package for inferring homologous relationships between protein-coding genes of multiple species, including generating Hierarchical Orthologous Groups. It takes as input several files in FASTA format, each containing all the protein-coding gene sequences in a species' genome - the proteome. It also requires a species tree to group homologous genes at each taxonomic level of the species of interest.
For this Module, you will need to use our GitPod instance.
In this exercise, we will run FastOMA to infer the orthology information for five yeast species. We already provided the proteomes of the five species in the GitPod environment, located at /workspace/SIBBiodiversityBioinformatics2024/Module3_FastOMA/working_dir/in_folder/proteome
.
Another input needed by FastOMA is the species tree. For our case, the species tree in newick format is provided at: /workspace/SIBBiodiversityBioinformatics2024/Module3_FastOMA/working_dir/in_folder/species_tree.nwk
. It is as follows:
(((Yarrowia_lipolytica:1,Saccharomyces_cerevisiae:1)Saccharomycetales:1,(Neosartorya_fumigata:1,Sclerotinia_sclerotiorum:1)leotiomyceta:1)Saccharomyceta:1,Schizosaccharomyces_pombe:1)Ascomycota;
You can visualize the species tree using the phylo.io website.
In the GitPod, FastOMA is already installed – you should be able to use it after logging into your GitPod workspace.
Optional (If you are not using GitPod)
If you want to install FastOMA on your system, you can follow the installation instructions on the FastOMA GitHub page.
If you want to download the proteomes on your own system, check out the following hint:
The UniProt database includes the proteomes of many species. You can download the reference proteomes of the following species from UniProt by clicking on “Download one protein sequence per gene (FASTA)”:
Right click on “Download one protein sequence per gene (FASTA)" and copy the link. Then, use wget
to download the file and unzip the file using gunzip software. For example for Schizosaccharomyces pombe:
wget https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/reference_proteomes/Eukaryota/UP000002485/UP000002485_284812.fasta.gz
gunzip -k UP000002485_284812.fasta.gz
1. In which format are the proteome files?
2. How many proteins are there in the Schizosaccharomyces pombe proteome?
grep ">" in_folder/proteome/Schizosaccharomyces_pombe.fa | wc -l
3. How many leaves are in the species tree? For how many species does the species tree provide evolutionary information?
The FastOMA algorithm runs in three main steps:
Note that these steps are efficiently executed, thanks to our highly-parallelized pipeline implemented in Nextflow. The output of FastOMA is reported in OrthoXML, which is the standard format for HOGs. For more information on HOGs, see Module 1 and also Zahn-Zabal et al. F1000, 2020 (page 4).
First change directory to the Module3_FastOMA/working_dir/
where the folder in_folder
exists.
cd /workspace/SIBBiodiversityBioinformatics2024/Module3_FastOMA/working_dir/
Then, check whether Nextflow is installed on your system by running nextflow -h
. Now we can use the command line to run FastOMA on the five proteomes in in_folder
, also using the species tree from in_folder/species_tree.nwk
1. What is the command line to run FastOMA using a local OMAmer database?
--omamer_db
FastOMA will download a large database, covering the entire OMA database. This is not a problem for most machines, but it could be problematic on GitPod.
nextflow FastOMA/FastOMA.nf --input_folder in_folder --output_folder out_folder --omamer_db in_folder/omamerdb.h5
Note: if the analysis is interrupted, you can use the -resume
flag.
Execute the above command to run FastOMA.
2. Where is the output orthoXML file?
in_folder
. What does it contain?
Recall that Orthologous Groups are groups of strict orthologs, with at most one representative per species. Hierarchical Orthologous Groups are groups of orthologs and paralogs, defined at each taxonomic level.
The output of FastOMA includes 3 folders and 7 additional files:
-hogmap
: contains the OMAmer results used by FastOMA (Described in the OMAmer Module); each file corresponds to an input proteome.
-OrthologousGroupsFasta
: contains FASTA files of marker orthologous groups
-RootHOGsFasta
: contains FASTA filea of sequences in each HOGs
-The main orthology results: FastOMA_HOGs.orthoxml
, orthologs.tsv.gz
, RootHOGs.tsv
, OrthologousGroups.tsv
-Synthetic reports: phylostratigraphy.html
, report.ipynb
, report.html
,
-The transformed input species tree: species_tree_checked.nwk
Orthologous groups
The folder OrthologousGroupsFasta
includes FASTA files, in which all proteins inside each file are orthologous to each other. These could be used as marker genes for species tree inference (Module 3).
Note: The answers to the questions below could slightly change in different runs.
1. How many Orthologous Groups are there?
2. How many genes in total are present in all Orthologous Groups?
grep geneRef FastOMA_HOGs.orthoxml | wc -l
Orthologous Groups which have a representative gene in every species could be considered as the “core genome” of the species of interest - genes that are conserved in all of the species.
3. How many Orthologous Groups include one representative gene for each species?
OrthologousGroups.tsv
have five genes. You can count how many times each group appears using this command: cat OrthologousGroups.tsv | cut -f 1 | sort | uniq -c | awk '{print $1}' | grep 5 | wc -l
.
Genes in orthologous groups could also be used for tasks such as resolving a species tree (see Module 4 Estimating a Species Tree).
4. Which genes are orthologous to the gene A7EQW0_SCLS1 (strict orthology)?
OrthologousGroups.tsv
using $ grep A7EQW0_SCLS1 OrthologousGroups.tsv
. Then, use the grep
command on the first column $ grep "OG_XXXXXXX" OrthologousGroups.tsv
.
Root HOGs
The file RootHOG.tsv
and the RootHOG
folder contain information about the highest level of HOGs included in the OrthoXML file. Contrary to Orthologous Groups, Root HOGs represent families of genes that all descend from one common ancestor, in the ancestor of the species represented in the dataset. As such genes may have undergone duplication during their evolutionary histories, they may contain more than one gene per species, which differentiates them with Orthologous Groups.
RootHOGs may be used to help resolve the evolutionary history of a certain gene family as they should contain all the homologs of the gene family.
5. How many root HOGs are in the HOG output file?
RootHOGs.tsv
denotes a gene. You can count how many times each root HOG appears using this command cat RootHOGs.tsv | cut -f 1 | sort | uniq -c | wc -l
Note that the first line is the header. So the output value - 1 will be the answer.
6. Consider the gene “60S ribosomal protein L15-A” in Schizosaccharomyces pombe with protein ID: RL15A_SCHPO. How many proteins are in the gene family (for these 5 species of interest)? What are their identifiers?
RootHOGs.tsv
using $ grep RL15A_SCHPO RootHOGs.tsv
. Then, use the grep
command on the first column $ grep "HOG:XXXXXXX" RootHOGs.tsv
7. In regards to the previous question, which species have more than one protein in the HOG? What does it mean?
Phylostratigraphy
By reconstructing HOGs, FastOMA also models the evolutionary histories of gene families: at which taxonomic level they are gained, lost, and duplicated. The results of this are contained in the phylostratigraphy files.
9. How many genes are duplicated at the level of Saccharomyceta?
phylostratigraphy.html
file in the out_folder
to your computer and open it in a browser.
10.Which species appear to have the most unique proteins?
Report from FastOMA
FastOMA produces a report in HTML format (report.html
) indicating information about the input proteomes and about specificity from the output. This report can be also explored using the Jupyter Notebook (report.ipynb
).
11. Which species has the most proteins in its proteome?
report.html
file.
The report also contains basic statistics about HOGs in the dataset.
12. What is the size of the HOG with the most members? What does it mean to have so many members?
FastOMA also computes a "Completeness Score," which indicates how many species below the defined taxonomic level are found in a given HOG. A high Completeness Score indicates genes have been found in all species in the clade, which typically means a high confidence HOG.
13. What is the maximum value of the Completeness Score?
report.html
file.
14. Are there many HOGs with a high Completeness Score?