Tutorial on How-to-Reconstruct a phylogeny using IQTree with a Graphical User Interface (GUI)
Tutorial video on how to use IQTree for constructing a phylogenetic tree with the DocMind Analyst
Working with an IQTree GUI within the DocMind Analystics Software Suite
IQTree is another popular program that is used for phylogenetic reconstruction. In our experience, it works great for whole genome sequence alignment while other tools are more specialized on short alignment lengths with a great number of sequences. IQTree is moreover much faster compared to standard RAxML and offers the possibility of a best model search. This comes in handy for beginners but can extend computational time significantly. For this reason, it is still recommended to start with the standard GTR and HKY models as you can use from the graphical interface of DocMind Analyst.
Input File Requirements
Carrying out this part of the pipeline as a single analysis is very easy. You just need to provide a subfolder in your current working directory and name it “Alignment”. In this subfolder you need to locate your FASTA alignment file called “Core_Alignment.faa”. While this file can be any alignment file you have generated, it is recommended that you only use alignment files generated by this pipeline. As usual, if you run the phylogenetic reconstruction as part of the pipeline, you don’t need to worry about these input requirements since they are already set up by the previous steps.
In order to submit the job, you just need to choose “IQTree” from the combo box in the “Variant calling and Phylogeny Options” panel and you need to check the “Phylogeny” box in the “Pipeline Options” panel. Press the “Submit Job” button to start the job. However, you need to agree with the DocMind Analyst “Terms of Service” by checking the respective box.
As beginner it’s easy since you can just start the pipeline and will get very appropriate results. However, there are some advanced options available in the “Standard IQTree Advanced Options” panel. Particularly on the “Model” option you should have a look, even as beginner. The default is “MFP” (= ModelFinder Plus), meaning that you don’t need to choose a substitution model but IQTree will determine the best-fit model. This is very convenient but might result in a significant extension of the analysis time in large datasets. However, if you are a beginner, this is an easy and great choice.
The best model that was used will be documented in the “IQTree.log” file. “MFP” can also be used as “MFP+ASC” which is the same but will correct for the ascertainment bias in case you are using alignment with variable site only (for instance the output of gubbins). Other choices are popular model like GTR and HKY that can be used in any combination with ascertainment bias correction (asc), invariable (+I) and Gamma rate (+G) heterogeneity across sites, as well as the FreeRate heterogeneity (+R).
But what does that mean? Well, phylogeny is based on the idea that mutations would occur spontaneously. From the Gubbins tutorial you already know that this is a dangerous assumption due to recombinations. But let’s assume a bacterial organisms where genomic changes would indeed be only due to spontaneous mutations. Would the occur across all sites of the genome with the same chance? Probably not, particularly in bacteria you find “genomic hotspots” where mutations occur much more frequently than at other sites.
In order to account for such mutation rate heterogeneity, researchers use quite often a gamma model, here indicates by the “+G”. Moreover, for bacteria it is also good to assume that some genomic sites would never change. Why? Because some site are so important that – if mutated – the organisms would not survive. So we will not see these positions ever changed in a living organisms. IF you want to account for such invariable sites, chose the “+I” option. A commonly used and recommended model that mostly reflects bacterial genomic evolution is GTR+G+I. For an introduction to DNA substitution model please read the following informative article.
The last step would be to decide about the number of replicates. The DocMind-Analyst has not setup IQTree to build up only one tree. For an appropriate phylogenetic analysis, you need to resample. This involves taking alignment position out of your analysis. Then a tree is rebuilt and it will be tested whether the same nodes appear in that tree compared to the previous tree(s). A node is a splitting point between two or more branches with their tips (tips are your samples).
It the displayed case, the node represents the most common recent ancestor (MRCA) of a group of samples, here taxon A and B, which are each other’s closest relatives. The MRCA is hypothetical, you have not sampled it. Taxon C is less related to A and B, but still has an ancestor with both taxa. However, such a tree might not be entirely correct. In order to make a more precise prediction, you perform the above described tree estimation not only ones, but quite often (100 times or more).
If you do it 100 times and you find that the same node appears in 95 out of 100 replicates, you know that node is well supported (with a bootstrap value of 95%). Lower support would suggest that only a few positions in your alignment support the node, as removing these positions leads to a different reconstruction of that node. Generally, a support value of 70 is considered high. However, if this is truly the case is still an ongoing matter of debate.
In IQTree, you can select the number of replicates. The more you use, the more accurate are your support values, but the more computational effort is needed, so your analysis will take longer. You can also choose between different bootstrapping methods. Default is the UltraFast Bootstrapping with a minimum of 1000 replicates (less replicates are not possible). This method is one of the reasons why IQTree is much faster compared to other tools and very useful for long bacterial genome assemblies.
UltraFast Bootstrapping is much faster compared to the standard non-parametric bootstrapping. This is still a choice, and you can use as few as 100 replicates for such an analysis. If you have extremely large alignments (> 1000 samples) it is recommended to perform resampling using single branch tests since they are faster than all bootstrapping methods. Here you can select SH-like approximate likelihood ratio test, the fast local probability bootstrap method (minimum of 1000 replicates required) or the approximate Bayes test (no replicate number needed, just keep the default value of 1000 in the respective text field).
There is one unique property of IQTree that you further need to know. Usually you can specify the number of CPU cores used for the analysis. Here, it is programmed in the way that IQTree will decide about that optimal number. Therefore, it might take all CPUs on your EC2 instance even if you have specified a lower number. This is important to know when you want to run multiple jobs in parallel on the same instance.
Module Output Files
After completion of the analysis, you will find the output files in a folder named “IQTree_Results” in your current working directory. The most important one is the “IQTree.contree”. It contains the consensus tree, so the best tree from all trees that you have resampled. It also contains the bootstrapping values. It’s also in the Newick format. You can load it in FigTree and visualize it. The “IQTree.iqtree” file contains a lot of useful information and presents you already with a simple tree visualization. Quite useful for the first glance. There are a lot of other output file. Thus, it is recommended to have a look at the IQTree manual to find out, whether they provide additional information for you.
Input Files: Any alignment file in FASTA format. The files needs to be located in a folder called “Alignment” in your current working directory. The file needs to be named “Core_Alignment.faa”.
Output Files: The most important ones are the two tree files “IQTree.contree” and IQTree.iqtree”. You find them in the subfolder named “IQTree_Results” in your current working directory.
Log-File: “IQTree.log”. You will find this file in the “IQTree_Results” subfolder in your current working directory. Always check this file for potential error messages.
Storage: IQTree needs only a bit of free disk space. Allow for 1 GB free disk space.
Recommended instance type: At least m5.2xlarge/m4.2xlarge. When your instance is stopped, you can change your instance type by clicking on Actions -> Instance Settings -> Change Instance Type. You might need to request usage of faster instances from the AWS support.
Timeframe: Approx. 1 minute for a 10-strains alignment file and choosing a specific model (depending on the AWS instance and the alignment length). It takes longer when you use MFP for model selection by IQTree.