Tutorial on How-to-Reconstruct a phylogeny using FastTree with a Graphical User Interface (GUI)
Tutorial video on how to use FastTree for constructing a phylogenetic tree with the DocMind Analyst
Working with a FastTree GUI within the DocMind Analyst Software Suite
FastTree is another program for phylogenetic reconstruction. It promises to be about 1000x faster than RAxML 7 for large alignment. In principle, it works like the other tools.
Input File Requirements
You just need to provide a subfolder in your current working directory and name it “Alignment”. In this subfolder you need to locate your FASTA alignment file called “Core_Alignment.faa”. While this file can be any alignment file you have generated, it is recommended that you only use alignment files generated by this pipeline. As usual, if you run the phylogenetic reconstruction as part of the pipeline, you don’t need to worry about these input requirements since they are already set up by the previous steps.
In order to submit the job, you just need to choose “FastTree” from the combo box in the “Variant calling and Phylogeny Options” panel and you need to check the “Phylogeny” box in the “Pipeline Options” panel. Press the “Submit Job” button to start the job. However, you need to agree with the DocMind Analyst “Terms of Service” by checking the respective box.
Two basic option should always be considered when using FastTree. You will find them in the “FastTree Basic Options” panel. IF you have a high number of sequences in your alignment (> 50000) it is recommended to use the “Fastest Option”. Just check the respective box if you want to use it. It can speed up your analysis about 4-fold with just little impact on tree accuracy. If you just have a few hundred sequences, but they are long (e.g. genomes) it might not have a strong effect.
You can also decide whether or not you want to compute support values by checking the “No support values” option. Support values – as explained more into detail in the IQTree tutorial – rely a lot on the assumption that there is no recombination between your samples. This is highly unlikely for bacterial genomes. As long as you don’t use the output of Gubbins (see Gubbins tutorial here) you might skip support value computation. However, when it comes to a scientific publication, such an approach might be punished by most reviewer, so be careful. Default is that you compute support values. In this case, FastTree will estimate the reliability of splits using the Shimodaira-Hasegawa test with 1000 replicates. One advantage of not computing support values would be a dramatic reduction in memory usage. Usually, FastTree needs about 4 GB RAM per 1 million positions in your alignment. But this would not be a problem when using an appropriate EC2 instance.
On the right-hand side, you will also find more options in the “FastTree Advanced Options” panel. With “Model”, you can choose the DNA substitution model from a combo box dropdown menu. Two commonly used options are provided. Both options would estimate the rate of heterogeneity by using the fast CAT model. What is that? Well, you might assume that all base positions in your alignment would mutate at the same rate. While this is easy, it is also a dangerous assumption. Especially bacterial genomes have evolutionary hotspots. Some positions are exposed to more selection pressure and would mutate faster than other which can even be very stable.
Generally, mutation rate heterogeneity among sites can be modeled in different ways. The most popular approach is to assume that the rate at each site is basically random and can be drawn from a statistical distribution, the gamma distribution. If you want this, you can check the “Gamma Option” box. FastTree will use a discrete gamma model with 20 rate categories (“gamma20”) which is quite precise. Such categories reflect different rates, and the more are used the better it resembles the continuous statistical distribution (here the gamma distribution). This is great but computational expensive. Alternatively, you can just not check this box and use the CAT model (default). It uses a best-fit rate for each site and is quite fast (about 4 – 5x faster than Gamma on some datasets, also with a lower memory consumption). CAT is also quite precise and is unlikely to perform inferior to Gamma.
Module Output Files
After analysis, output files can be found in a subfolder named “FastTree_Results”. It is pretty simple. You will just find the Log-File and the Tree-File. The latter is called “FT_final.tree”. Again, it’s the common Newick format, and you can load the file in FigTree or a different tree visualization program. We provide a FigTree tutorial here.
Check list for FastTree
Input Files: Any alignment file in FASTA format. The files needs to be located in a folder called “Alignment” in your current working directory. The file needs to be named “Core_Alignment.faa”.
Output Files: The tree file “FT_final.tree” and a log file can be found in a subfolder named “FastTree_Results” in your current working directory.
Log-File: “fasttree.log”. You will find this file in your current working directory. Always check this file for potential error messages.
Storage: FastTree needs only few free disk space. Allow for 1 GB free disk space.
Recommended instance type: At least m5.4xlarge/m4.4xlarge. When your instance is stopped, you can change your instance type by clicking on Actions -> Instance Settings -> Change Instance Type. You might need to request usage of faster instances from the AWS support.
Timeframe: Approx. 30 minutes for a 10-strains alignment with the “fastest” option (depending on the AWS instance and the alignment length).