Module 3: FastOMA

Back to home / Reset

3.1 Obtaining data and getting setup

FastOMA is a software package for inferring homologous relationships between protein-coding genes of multiple species, including generating Hierarchical Orthologous Groups. It takes as input several files in FASTA format, each containing all the protein-coding gene sequences in a species' genome - the proteome. It also requires a species tree to group homologous genes at each taxonomic level of the species of interest.


For this Module, you will need to use our GitPod instance.


In this exercise, we will run FastOMA to infer the orthology information for five yeast species. We already provided the proteomes of the five species in the GitPod environment, located at /workspace/SIBBiodiversityBioinformatics2024/Module3_FastOMA/working_dir/in_folder/proteome.

Another input needed by FastOMA is the species tree. For our case, the species tree in newick format is provided at: /workspace/SIBBiodiversityBioinformatics2024/Module3_FastOMA/working_dir/in_folder/species_tree.nwk. It is as follows:

(((Yarrowia_lipolytica:1,Saccharomyces_cerevisiae:1)Saccharomycetales:1,(Neosartorya_fumigata:1,Sclerotinia_sclerotiorum:1)leotiomyceta:1)Saccharomyceta:1,Schizosaccharomyces_pombe:1)Ascomycota;

You can visualize the species tree using the phylo.io website.

In the GitPod, FastOMA is already installed – you should be able to use it after logging into your GitPod workspace.


Optional (If you are not using GitPod)

If you want to install FastOMA on your system, you can follow the installation instructions on the FastOMA GitHub page.

If you want to download the proteomes on your own system, check out the following hint:

Instruction for downloading proteomes

The UniProt database includes the proteomes of many species. You can download the reference proteomes of the following species from UniProt by clicking on “Download one protein sequence per gene (FASTA)”:

Right click on “Download one protein sequence per gene (FASTA)" and copy the link. Then, use wget to download the file and unzip the file using gunzip software. For example for Schizosaccharomyces pombe:

wget https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/reference_proteomes/Eukaryota/UP000002485/UP000002485_284812.fasta.gz gunzip -k UP000002485_284812.fasta.gz


  • 1. In which format are the proteome files?

    FASTA format

  • 2. How many proteins are there in the Schizosaccharomyces pombe proteome?

    Each record in a FASTA file always starts with ">". You can use the following command to calculate the number of records in the FASTA file: grep ">" in_folder/proteome/Schizosaccharomyces_pombe.fa | wc -l

    There are 5,122 proteins in this FASTA file.

  • 3. How many leaves are in the species tree? For how many species does the species tree provide evolutionary information?

    The answer to both questions is 5.

3.2 Running FastOMA

The FastOMA algorithm runs in three main steps:

  1. Mapping input proteins to existing Hierarchical Orthologous Groups (HOGs) in the OMA database using OMAmer (see Module 2)
  2. Inferring gene families based on the OMAmer results
  3. Orthology inference, resulting in new HOGs

Note that these steps are efficiently executed, thanks to our highly-parallelized pipeline implemented in Nextflow. The output of FastOMA is reported in OrthoXML, which is the standard format for HOGs. For more information on HOGs, see Module 1 and also Zahn-Zabal et al. F1000, 2020 (page 4).

First change directory to the Module3_FastOMA/working_dir/ where the folder in_folder exists.

cd /workspace/SIBBiodiversityBioinformatics2024/Module3_FastOMA/working_dir/

Then, check whether Nextflow is installed on your system by running nextflow -h. Now we can use the command line to run FastOMA on the five proteomes in in_folder, also using the species tree from in_folder/species_tree.nwk

  • 1. What is the command line to run FastOMA using a local OMAmer database?

    Check the FastOMA GitHub page. Important note: without specifying --omamer_db FastOMA will download a large database, covering the entire OMA database. This is not a problem for most machines, but it could be problematic on GitPod.

    nextflow FastOMA/FastOMA.nf --input_folder in_folder --output_folder out_folder --omamer_db in_folder/omamerdb.h5

    Note: if the analysis is interrupted, you can use the -resume flag.

Execute the above command to run FastOMA.

  • 2. Where is the output orthoXML file?

    Check the parent directory ofin_folder. What does it contain?

    Output_folder. The output of FastOMA includes three folders (hogmap, OrthologousGroupsFasta, RootHOGsFasta) and seven files (FastOMA_HOGs.orthoxml, orthologs.tsv.gz, OrthologousGroups.tsv, phylostratigraphy.html, report.ipynb, RootHOGs.tsv, report.html, and species_tree_checked.nwk).

3.3 Interpreting the results

Recall that Orthologous Groups are groups of strict orthologs, with at most one representative per species. Hierarchical Orthologous Groups are groups of orthologs and paralogs, defined at each taxonomic level.

The output of FastOMA includes 3 folders and 7 additional files:

-hogmap: contains the OMAmer results used by FastOMA (Described in the OMAmer Module); each file corresponds to an input proteome.

-OrthologousGroupsFasta: contains FASTA files of marker orthologous groups

-RootHOGsFasta: contains FASTA filea of sequences in each HOGs

-The main orthology results: FastOMA_HOGs.orthoxml, orthologs.tsv.gz, RootHOGs.tsv, OrthologousGroups.tsv

-Synthetic reports: phylostratigraphy.html, report.ipynb, report.html,

-The transformed input species tree: species_tree_checked.nwk

Orthologous groups

The folder OrthologousGroupsFasta includes FASTA files, in which all proteins inside each file are orthologous to each other. These could be used as marker genes for species tree inference (Module 3).

Note: The answers to the questions below could slightly change in different runs.

  • 1. How many Orthologous Groups are there?

    You can count the number of FASTA files in the folder OrthologousGroupsFasta.

    6,930

  • 2. How many genes in total are present in all Orthologous Groups?

    Genes are coded in the OrthoXML file as <geneRef id="1002001760"/> (for example). We need to count the number of lines which contain "geneRef":

    grep geneRef FastOMA_HOGs.orthoxml | wc -l

    There are 23,860 genes in the groups.

Orthologous Groups which have a representative gene in every species could be considered as the “core genome” of the species of interest - genes that are conserved in all of the species.

  • 3. How many Orthologous Groups include one representative gene for each species?

    Count how many groups in OrthologousGroups.tsv have five genes. You can count how many times each group appears using this command: cat OrthologousGroups.tsv | cut -f 1 | sort | uniq -c | awk '{print $1}' | grep 5 | wc -l.

    There are 1,759 Orthologous Groups having five genes.

Genes in orthologous groups could also be used for tasks such as resolving a species tree (see Module 4 Estimating a Species Tree).

  • 4. Which genes are orthologous to the gene A7EQW0_SCLS1 (strict orthology)?

    Find the corresponding line in the OrthologousGroups.tsv using $ grep A7EQW0_SCLS1 OrthologousGroups.tsv. Then, use the grep command on the first column $ grep "OG_XXXXXXX" OrthologousGroups.tsv.

    tr|Q4WEI0|Q4WEI0_ASPFU, tr|A7EQW0|A7EQW0_SCLS1, sp|P32468|CDC12_YEAST, tr|Q6C7L3|Q6C7L3_YARLI, sp|P48009|SPN4_SCHPO

Root HOGs

The file RootHOG.tsv and the RootHOG folder contain information about the highest level of HOGs included in the OrthoXML file. Contrary to Orthologous Groups, Root HOGs represent families of genes that all descend from one common ancestor, in the ancestor of the species represented in the dataset. As such genes may have undergone duplication during their evolutionary histories, they may contain more than one gene per species, which differentiates them with Orthologous Groups. RootHOGs may be used to help resolve the evolutionary history of a certain gene family as they should contain all the homologs of the gene family.

  • 5. How many root HOGs are in the HOG output file?

    Each line in the output file RootHOGs.tsv denotes a gene. You can count how many times each root HOG appears using this command cat RootHOGs.tsv | cut -f 1 | sort | uniq -c | wc -l Note that the first line is the header. So the output value - 1 will be the answer.

    There are 6,983 root HOGs (gene families) in this file.

  • 6. Consider the gene “60S ribosomal protein L15-A” in Schizosaccharomyces pombe with protein ID: RL15A_SCHPO. How many proteins are in the gene family (for these 5 species of interest)? What are their identifiers?

    Find the corresponding line in RootHOGs.tsv using $ grep RL15A_SCHPO RootHOGs.tsv. Then, use the grep command on the first column $ grep "HOG:XXXXXXX" RootHOGs.tsv

    There are 7 proteins in this family: RL15A_YEAST, Q4WJV5_ASPFU, sp|P54780|RL15B_YEAS, RL15B_SCHPO, Q6C7Y3_YARLI, A7EYU6_SCLS1, sp|O74895|RL15A_SCHPO.

  • 7. In regards to the previous question, which species have more than one protein in the HOG? What does it mean?

    YEAST (Saccharomyces cerevisiae) and SCHPO (Schizosaccharomyces pombe) have more than one protein in the Root HOG, which indicate there have been one or more duplications in this gene family.

  • 8. Find the FASTA file for this Root HOG. It could be used as an input for more analysis, such as gene tree analysis.

    FASTA files are in the folder RootHOGsFasta, which could be used to align them using MAFFT and infer the gene trees (FastTree both are already installed in the environment).

Phylostratigraphy

By reconstructing HOGs, FastOMA also models the evolutionary histories of gene families: at which taxonomic level they are gained, lost, and duplicated. The results of this are contained in the phylostratigraphy files.

  • 9. How many genes are duplicated at the level of Saccharomyceta?

    Download the phylostratigraphy.html file in the out_folder to your computer and open it in a browser.

    359

  • 10.Which species appear to have the most unique proteins?

    Proteins unique to a species are gained on the terminal branch and reported at the leaf level.

    Sclerotinia sclerotiorum, with 8,365 genes gained.

Report from FastOMA

FastOMA produces a report in HTML format (report.html) indicating information about the input proteomes and about specificity from the output. This report can be also explored using the Jupyter Notebook (report.ipynb).

  • 11. Which species has the most proteins in its proteome?

    Check the section "Stats on input dataset" in the report.html file.

    Sclerotinia sclerotiorum

The report also contains basic statistics about HOGs in the dataset.

  • 12. What is the size of the HOG with the most members? What does it mean to have so many members?

    One HOG has 30 members. This indicates it likely underwent many duplications. This HOG contains proteins from the Hexose transporter family, which is highly duplicated in Yeast (https://pubmed.ncbi.nlm.nih.gov/20660490/).

FastOMA also computes a "Completeness Score," which indicates how many species below the defined taxonomic level are found in a given HOG. A high Completeness Score indicates genes have been found in all species in the clade, which typically means a high confidence HOG.

  • 13. What is the maximum value of the Completeness Score?

    Check the section "Roothogs (deepest levels for every HOG)" in the report.html file.

    1

  • 14. Are there many HOGs with a high Completeness Score?

    Yes, most HOGs have a completeness score of 1.

If FastOMA is installed in your local computer and you have a Jupyter Notebook installation, you can dynamically explore the contents of the HOGs in the FastOMA results using the PyHAM library which is already installed in the FastOMA environment.