Tutorial on How-to-Reconstruct a phylogeny using Gubbins with a Graphical User Interface (GUI)
Tutorial video on how to use gubbins for constructing a phylogenetic tree with the DocMind Analyst
Working with a Gubbins GUI with the DocMind Analyst Software Suite
The first phylogenetic reconstruction tool for usage is Gubbins. It is unique in its feature that it targets estimation of bacterial population phylogenetic reconstruction. Most tools would assume that genetic recombination in your dataset can be neglected. However, this can introduce a severe bias, particularly when analyzing bacterial genomes that are characterized by short-term evolution, e.g. by horizontal gene transfer or transformation. Gubbins is capable of identifying such regions of recombination based on their high density of base substitutions. Phylogenetic reconstruction will then only be based on mutations outside these regions and is run by independent software, here FastTree2. Gubbins is sufficiently fast and can analyze alignments with hundreds of genomes in a less than a day.
Input File Requirements
Carrying out this part of the pipeline as a single analysis is very easy. You just need to provide a subfolder in your current working directory and name it “Alignment”. In this subfolder you need to locate your FASTA alignment file called “Core_Alignment.faa”. While this file can be any alignment file you have generated, it is recommended that you only use alignment files generated by this pipeline. As usual, if you run the phylogenetic reconstruction as part of the pipeline, you don’t need to worry about these input requirements since they are already set up by the previous steps.
In order to submit the job, you just need to choose “Gubbins” from the combo box in the “Variant calling and Phylogeny Options” panel and you need to check the “Phylogeny” box in the “Pipeline Options” panel. Press the “Submit Job” button to start the job. However, you need to agree with the DocMind Analyst “Terms of Service” by checking the respective box.
As beginner, you don’t need to think about options. Use the defaults and this will usually get you great results. However, if you are advanced and want to perform some fine tuning, the DocMind Analyst provides you with two options. The first is the “Min SNPs for Rec” option in the “Gubbins Advanced Options” panel. Here you can specify the minimum number of base substitutions to recognize a recombination. The default is “3”. Increasing the value will decrease the false positive rate but possibly increase the false negative rate. Decreasing will do vice versa. The other is the “Maximum Iteration” option. Here you can specify how many times iterations will be performed and come up with a new optimized tree. If two successive iteration converge on the same tree, it will stop, regardless of how many iterations were approved. While increasing this value might result in a more accurate tree, it also increases calculation time. Gubbins might need days to converge on large datasets (hundreds of genomes) and could require several gigabytes of RAM. Make sure you select an EC2 instance with at least 64 GB RAM for large datasets and do not to set the iteration number too high.
Module Output Files
Output files can be found in the folder “Gubbins_Results” which will be located in your current working directory. The most important one is the “Core_Alignment.final_tree.tre” – the tree file. The final tree is saved here as the Newick format. You can open it with all usual tree visualization tools. We recommend FigTree which is preinstalled on the DocMind Analyst machine image. If you want to have an immediate glance on how your tree looks, you can open the “Core_Alignment.pdf”. Apart from the tree you will see colored blocks for each isolate on the right hand side. These blocks represent regions identified as recombinations. Blue blocks are unique to a single isolate while red blocks are shared by multiple isolates. The horizontal position of the blocks represents their position in the alignment. There are two other output files absolutely worth mentioning. One is the “Core_Alignment.per_branch_statistics.csv”. It contains several statistics. We have copied a table from the gubbins manual. It will give you an overview over the statistics provided.
The probably most interesting value is the r/m ratio which gives you a very good and objective indicator for the abundance of recombination in each sample of your dataset. The last output file is the “Core_Alignment.filtered_polymorphic_sites.fasta”. This one is unique. It only contains the SNPs from regions outside of recombination. None of the other tools will provide you with such a file. You can easily use it as an input alignment file for the other phylogenetic reconstruction tools. That way you can easily prevent the recombination bias. However, since only SNPs (variable sites) are included in that alignment (and no invariable sites) you need to make sure that you use a substitution model that accounts for the ascertainment bias. For instance IQTree is able to do that (tutorial here). Using this alignment would also have the advantage to speed up computation since the alignment length is much shorter.
Check list for Gubbins
Input Files: Any alignment file in FASTA format. The files needs to be located in a folder called “Alignment” in your current working directory. The file needs to be named “Core_Alignment.faa”.
Output Files: You can find all output files in the folder “Gubbins_Results”. The most important one is the “Core_Alignment.final_tree.tre” file. Use it to display your tree, e.g. with FigTree.
Log-File: “gubbins.log”. You will find this file in your current working directory. Always check this file for potential error messages.
Storage: Almost no additional storage is required. Keep about 10 GB free for temporary files.
Recommended instance type: At least m5.4xlarge/m4.4xlarge. When your instance is stopped, you can change your instance type by clicking on Actions -> Instance Settings -> Change Instance Type. You might need to request usage of faster instances from the AWS support.
Timeframe: Approx. 10 – 15 minutes for a 10-strains alignment file (depending on the AWS instance and alignment size).