Module 4: Estimating a Species Tree

Back to home / Reset

Extracting Data for a Species Tree: from the OMA Browser

To compute a species tree using concatenated alignments, we need the sequences of orthologous genes for each species. If the species of interest are all in OMA, it is easy to get this information through the browser.

For this exercise, we are interested in the phylogeny of 20 yeast species in OMA. You will need to download OMA Groups present in most of the species to have a dataset for species tree estimation. In the OMA browser, go to "Download > Export Marker Genes".

  • 1. Select these 20 species.

    • Schizosaccharomyces japonicus (strain yFS275 / FY16936)
    • Saitoella complicata(strain BRC 22490)
    • Yarrowia lipolytica< (strain CLIB 122)
    • Kluyveromyces lactis (strain ATCC 8585)
    • Saccharomyces cerevisiae (strain ATCC 204508)
    • Saccharomyces cerevisiae (strain VIN13)
    • Zygosaccharomyces rouxii
    • Punctularia strigosozonata (strain HHB 11173)
    • Fomitiporia mediterranea (strain MF3/22)
    • Fomitopsis pinicola (strain FP-58527)
    • Phlebiopsis gigantea
    • Heterobasidion annosum
    • Serendipita indica (strain DSM11827)
    • Coprinopsis cinerea (strain Okayama-7 )
    • Laccaria bicolor
    • Cryptococcus gatti
    • Cryptococcus neoformans var. neoformans serotype
    • Rhodotorula graminis (strain WP1)
    • Puccinia graminis sp. tritici
    • Ustilago hordei
    Hint: to help locate the species, you can search for their name. Then right-click on their name > select species.

  • 2. In which clade was their last common ancestor?

    Dikarya. The most studied Fungi are a member of this clade.

  • 3. You can vary two parameters for this step: number of genes you wish to export (too many increases the required computation) and the proportion of species that must be in a group. Set the number of gene to -1 (no limit) amd chose 0.5, 0.70, 0.9 and 1.0 as proportion of species. How many groups of genes do you have in each case?

    • 0.5: 3041
    • 0.7: 1679
    • 0.9: 740
    • 1.0: 167
  • 4. What would be the best choice?

    The goal is to have enough conserved genes (Groups) to amplify phylogenetic signal but not too much to keep the computation tractable. As a rule of thumb, the best is to have between 100 and 1000 genes. With 167, 1.0 has enough genes. With more time, 0.9 could be envisioned but already is more demanding.

Extracting data for a species tree: From local genomes with OMA Standalone

To compute a species tree, the sequences of orthologous genes of each species is needed. If the species of interest are not all in OMA, you have to run OMA Standalone to compute the marker gene you’ll need. However, you can still use the computations done from the species in OMA to speed up the process.

For this exercise, you are interested in the phylogeny of 2 species hypothetically not in OMA (either you just sequenced it or it was not included in the browser) and 18 yeasts species in OMA. As orthology between the 18 species were already computed, you need to download their all-vs-all data. In the OMA Browser, go to Download > Export All-All

  • Select these 18 species.

    • Schizosaccharomyces japonicus (strain yFS275 / FY16936)
    • Saitoella complicata (strain BRC 22490)
    • Yarrowia lipolytica (strain CLIB 122)
    • Kluyveromyces lactis (strain ATCC 8585)
    • Saccharomyces cerevisiae (strain VIN13)
    • Zygosaccharomyces rouxii
    • Punctularia strigosozonata (strain HHB 11173)
    • Fomitiporia mediterranea (strain MF3/22)
    • Phlebiopsis gigantea
    • Heterobasidion annosum
    • Serendipita indica (strain DSM11827)
    • Coprinopsis cinerea (strain Okayama-7)
    • Laccaria bicolor
    • Cryptococcus gatti
    • Cryptococcus neoformans var. neoformans serotype D
    • Rhodotorula graminis (strain WP1)
    • Puccinia graminis sp. tritici
    • Ustilago hordei
    You can search for their name in the search bar for speed. Then right-click on their name -> select species.

  • 2.Examine the contents of the archive. We have exported the precomputed all-by-all for our 18 fungal genomes in order to save time when running OMA standalone. The genomes are stored in the DB/ folder. Which species has the biggest genome and how many predicted proteins are in the file?

    Genomes are stored in the FASTA format. FASTA headers always start with ">"

    • You can search for this character in text editor to count its occurences.
    • In the command line, you may use grep doc to select the header lines (starting with ">") and count them.
    • If you are more familiar with Python, you could use Bio.SeqIO or similar.
    The biggest genome is the one of Puccinia graminis with 20363 proteins, much more than the other species.

  • 3. In what folder are the precomputed all-by-all alignments stored in?

    You can find the all-by-all alignment in the Cache/AllAll folders

  • 4. Now we want to add our genomes. Add the following fungal genome to your dataset: YEAST.fa and SPOMBE.fa Now run OMA standalone on the 20 fungal genomes.
    Before starting the computation, it is wise to adjust the parameters file, called “parameters.drw”, which can be edited with any text editor. If the goal is to only generate a dataset for species phylogeny inference (and not keeping other unrelated orthology inferences), one can avoid doing computations and generating output files that are not needed by the following:

    • Uncomment (remove the # from) all the lines starting with WriteOutput EXCEPT #WriteOutput_OrthologousGroupsFasta := false. By keeping that one commented, OMA standalone will produce one fasta file for each inferred OG.
    • Deactivate the Hierarchical Orthologous Group inference, which is not needed here, by setting DoHierarchicalGroups := false;
    • Likewise, deactivate the gene function prediction by setting DoGroupFunctionPrediction := false
    • Do not omit a semicolon at the end of each uncommented statement.

    To run the software, type the following command on the command line from within the OMA.2.4.2 folder: >bin/oma

    If not done on a parallel framework, this can take hours. To speed up time, you can get the completed file by downloading this Figshare archive, under Protocol_2/data/OMA.2.3.1

  • 5. OMA has estimated a rough species tree from orthologous groups, using a distance-based method. (If you know the phylogeny of the species, you can give a predefined species tree in the parameters file. The is written in the newick format. How do you read this format?

    Each comma represents a bifurcation and a parenthesis is a bipartition. A tree of this form ((species A, species B), species C); would have the species A and B together on a terminal branch, and C splitting off before.

  • 6. You can visualize the tree by pasting the newick line into Examine the tree. Where do you think it should be rooted? Keep this tree aside, we will discuss it in more depth in the ‘Visualizing and interpreting tree’ module.

    The tree should be rooted between Ascomycetes and Basidiomycetes, the two major clades in Dikarya.

  • 7. Not trusting this lowly distance tree reconstructed by OMA standalone, you set out to perform a proper tree inference. For that you’ll need orthologous groups containing genes from each genome. Go to the OrthologousGroups : How many are there?

    There are 23200 orthogroups.

  • 8. Optional : Use the Python script provided to obtain groups with at least 100% species (Click the hint button for the command line). How many are there now?

    python --min-nr-species <MIN_NR_SPECIES> --input Output/OrthologousGroupFasta/ --output <destination/directory>

    There are only 220 OG with 20 species in it.

SuperMatrix and Tree Estimation

With marker genes obtained, for example, from the OMA Browser or from an OMA standalone computation, one way to obtain the species phylogeny is to align all sequences within each orthologous group, before combining these alignments into a single concatenated alignment: this is called a supermatrix and is then used as input to tree estimation software to obtain the species tree.

Note: this exercise requires multiple pieces of command line software, as well as Python scripts. If you wish to proceed with this exercise you will need to install these on your computer. Links will take you to the required software. If not, please go directly to the "Visualising and Understanding Trees" section.

  • 1. First, we need to align the sequences within the orthologous groups. For this, it is possible to use any software for multiple sequences alignment, such as MAFFT. For example, this command should work from within the directory containing the FASTA files for the marker genes: $ for i in $(ls -1 *.fa); do mafft --maxiterate 1000 --localpair $i > $i.aln; done </code>

    Note : Computation can take time. If you do not wish to wait for the results, you can find example alignment files in this Figshare archive (under Protocol1/Data/Alignment) for example.

  • 2. Open one of the resulting aligment files. What format is it in?

    Hint: you can visualise it in MView -- see the gene tree module for interpreting alignments.

    FASTA format.

  • 3. The next step is to concatenate the alignments into the supermatrix. A Python script ( is available in this git repository to perform this. This can be run as so: $ python &lt;path&gt;/&lt;to&gt;/&lt;alignments&gt;/*aln --format-output [fasta/phylip] > output What size is the new, concatenated alignment?

    The alignment is 118190 positions long.

  • 4. Finally, use phylogenetic software to estimate the species tree. Here, we propose using RAxML with this command: $ raxmlHPC -f a -m PROTGAMMALG -p 12345 -x 12345 -# 100 -s MSA.fa -n tree.nwk This should result in a tree in the newick format.

    Note : Computation can take time. If you do not wish to wait for the results, you can find example alignment files in this Figshare archive (under Protocol1/trees/RAxML/)RAxML_bipartitionsBranchLabels.100.phylo_4 for example)

    We will analyse this tree in the next module - "Visualising and Understanding Trees".

Visualising and Understanding Trees

Once a phylogeny has been obtained, it is important to have the tools to visualise it, as well as the difficulties in interpretation. This is particularly the case because, even if they are displayed in an unambiguous way, uncertainty may exist in the topology.

For this section, we will use the tree you obtained in the last module. If you skipped this part, please use the RAxML tree available in FigShare under Protocol_1 (Protocol_1/trees/RAXML/RAxML_bipartitionsBranchLabels.100.phylo_4).

We will visualise the tree through a browser on the website. Copy the newick line into the text panel.

  • 1. Unless an outgroup was specified in the parameters, undirected trees are often arbitrarily or approximately rooted. What is rooting?

    Rooting a tree is the act of choosing a branch on an undirected tree where should be the common ancestor for the species.

  • 2. An outgroup is sometimes included in the analysis, which can then be used to choose the root by maintaining maximum distance between the outgroup and the rest of the species. Here we do not have any outgroup species, how can we choose the root?

    Since we know dikarya are divided between ascomycota and basidiomycota, we can place the root between these two groups of species.

  • 3. After rooting, we can see the topology of the tree. However, we do not have equal certainty in each of the branches. Click on "Setting > Branch label". What is a bootstrap value?

    Bootstrap value is the proportion of time a partition was found when shuffling columns of the alignment. The closer it is to 100, the clearer it is the topology is not driven by a specific part of the alignment.

  • 4. What branches are the least certain?

    Here the branches leading to PUNST, PHLGI, FOMPI, HETAN are the least certain and the one between SCHZY and SAICN should be somewhat doubted.

  • 5. The branch lengths give information about the distance between species under the topology. You can show the actual value of the length under "Setting > Branch length". What species are the closest?

    Both saccharomyces species, as would be expected since they are of the same genus.

  • 6. As long as the set of leaves is the same, you can compare two trees with Download the tree in Protocol_2/data/OMA.2.3.1/Output/EstimatedSpeciesTree.nwk, from FigShare. What is different?

    The same branches as in the last question. Whilst caution is warranted and more species should be added to increase certainty, the one obtained from the supermatrix is likely correct.