Module 3: Gene Trees

Back to home / Reset

Retrieving Orthologs for a Gene Tree

In order to construct phylogenetic trees, we make comparisons between sets of genes belonging to different taxa. We are working under the assumption that the taxa we are studying are related by evolution. Therefore, we need to ensure that the genes we use to construct our trees are orthologous - that is, they have evolved through speciation events, from a common ancestor.

OMA Groups are groups of sequences that are all orthologous to one another, and can be found in the OMA Browser. See the OMA Group module for more details.

  • 1. Retrieve the following groups from the OMA Browser. Note: open each in a different tab of your browser.

    • 197280 (fingerprint: NRFHAGC)
    • 807295 (fingerprint: FNSPKVY)
    • 583644 (fingerprint: RTWSAYH)
    Hint: click on the question mark icon next to the group ID to get a description of each of the tabs.

  • 2. What phylogeny are these groups from?

    All of these groups are from Fungi, but not encompassing the same set of species.

  • 3. What is the description / function of each of the groups?

    • 197280: ribosome-associated ubiquitin-dependent protein catabolic process
    • 807295: 1,3-beta-glucan synthase
    • 583644: Trans-aconitate 3-methyltransferase
  • 4. How many other OMA Groups share homology to each of the groups?

    • 197280: 1
    • 807295: 32
    • 583644: 51
  • 5. How many sequences are there in each orthologous group?

    • 197280: 31
    • 807295: 27
    • 583644: 50
  • 6. Are they protein or DNA sequences? Download the sequences.

    They are protein sequences

Building a Gene Tree

In order to construct phylogenetic trees, we must first align the sequences. This allows us to compare sequences site by site. There are a multitude of Multiple Sequence Alignment (MSA) tools available, many of which can be found on the EBI website here.

  • 1. Use an online sequence aligner to align the sequences. Which output format should you choose?

    You may use the FASTA format, which is a classical format for alignments.

  • 2. Construct a phylogenetic tree using your aligned sequences. Tools can be found on the Vital-IT website (RAxML BlackBox - this could take between 10 minutes to over an hour, depending on the Group and model!) or on the EBI website Simple Phylogeny [ClustalW2] - try building two trees by using both UPGMA and NJ clustering methods in the clustering options).

    Alternatively, you can try the alignment and tree inference tools available online at (in particular PhyML) or an IQTree webserver here, as an alternative to RAxML if the queue is too long at Vital-IT.

  • 3. Why are RAxML and other likelihood methods much slower than the clustering methods?

    Likelihood methods typically evaluate the likelihood of any topology under a model to select the best. Even if most methods use heuristic to speed up computation, it is still more time-consuming than clustering methods that build only one phylogeny by clustering sequences step-by-step.

  • 4. Which group do you think will take the longest to compute a tree for, using RAxML? Which do you think will be the quickest?

    The fastest to compute will be 807925 with fewer sequences and smaller ones. The longest will be 583644 with more sequences.

Evaluating a Tree

Now that we have our trees, we would like to visualise and compare them. Unfortunately, the output format (Newick) isn't particularly conducive to interpreting trees. Thankfully, there are online viewers such as to help us.

  • 1. View the trees that you have built using an online tree visualisation tool.

  • 2. Reroot the trees, and swap branches, to make comparisons easier.

  • 3. Which species, in group 197280, shares a most recent common ancestor with Trametes versicolor?

    Dichomitus squalens (strain LYAD-421)

  • 4. Which species is most closely related to Mucor circinelloides in group 807295?

    Rhizopus oryzae

  • 5. Can you see any difference between trees estimated using different methods? For instance, which species are grouped differently when comparing the trees found by the UPGMA and NJ methods in group 197280?

    The cluster of three sequences containing ROGHW03087 is closer to the one containing TREME05093 in the NJ tree but closer to the one containing WALI902966 in RAxML while the cluster themselves are of the same composition in both trees. Similarly, PUNST05724 clusters differently in each tree. Other differences may be spotted but make sure the tree is rooted similarly in both trees to make them comparable.