To compute a species tree using concatenated alignments, we need the sequences of orthologous genes for each species. If the species of interest are all in OMA, it is easy to get this information through the browser.
For this exercise, we are interested in the phylogeny of 20 yeast species in OMA. You will need to download OMA Groups present in most of the species to have a dataset for species tree estimation. In the OMA browser, go to "Download > Export Marker Genes".
To compute a species tree, the sequences of orthologous genes of each species is needed. If the species of interest are not all in OMA, you have to run OMA Standalone to compute the marker gene you’ll need. However, you can still use the computations done from the species in OMA to speed up the process.
For this exercise, you are interested in the phylogeny of 2 species hypothetically not in OMA (either you just sequenced it or it was not included in the browser) and 18 yeasts species in OMA. As orthology between the 18 species were already computed, you need to download their all-vs-all data. In the OMA Browser, go to Download > Export All-Allgrep
doc to select the header lines (starting with ">") and count them.Bio.SeqIO
or similar.To run the software, type the following command on the command line from within the OMA.2.4.2 folder:
>bin/oma
If not done on a parallel framework, this can take hours. To speed up time, you can get the completed file by downloading this Figshare archive, under Protocol_2/data/OMA.2.3.1
((species A, species B), species C);
would have the species A and B together on a terminal branch, and C splitting off before.
python filter_groups.py --min-nr-species <MIN_NR_SPECIES> --input Output/OrthologousGroupFasta/ --output <destination/directory>
With marker genes obtained, for example, from the OMA Browser or from an OMA standalone computation, one way to obtain the species phylogeny is to align all sequences within each orthologous group, before combining these alignments into a single concatenated alignment: this is called a supermatrix and is then used as input to tree estimation software to obtain the species tree.
Note: this exercise requires multiple pieces of command line software, as well as Python scripts. If you wish to proceed with this exercise you will need to install these on your computer. Links will take you to the required software. If not, please go directly to the "Visualising and Understanding Trees" section.$ for i in $(ls -1 *.fa); do mafft --maxiterate 1000 --localpair $i > $i.aln; done </code>
Note : Computation can take time. If you do not wish to wait for the results, you can find example alignment files in this Figshare archive (under Protocol1/Data/Alignment) for example.
concat_alignments.py
) is available in this git repository to perform this. This can be run as so:
$ python concat_alignments.py <path>/<to>/<alignments>/*aln --format-output [fasta/phylip] > output
What size is the new, concatenated alignment?
$ raxmlHPC -f a -m PROTGAMMALG -p 12345 -x 12345 -# 100 -s MSA.fa -n tree.nwk
This should result in a tree in the newick format.
Note : Computation can take time. If you do not wish to wait for the results, you can find example alignment files in this Figshare archive (under Protocol1/trees/RAxML/)RAxML_bipartitionsBranchLabels.100.phylo_4 for example)
We will analyse this tree in the next module - "Visualising and Understanding Trees".
Once a phylogeny has been obtained, it is important to have the tools to visualise it, as well as the difficulties in interpretation. This is particularly the case because, even if they are displayed in an unambiguous way, uncertainty may exist in the topology.
For this section, we will use the tree you obtained in the last module. If you skipped this part, please use the RAxML tree available in FigShare under Protocol_1 (Protocol_1/trees/RAXML/RAxML_bipartitionsBranchLabels.100.phylo_4).We will visualise the tree through a browser on the phylo.io website. Copy the newick line into the text panel.
Protocol_2/data/OMA.2.3.1/Output/EstimatedSpeciesTree.nwk
, from FigShare. What is different?