Tutorials on How-to-Do run a core genome phylogenetic analysis pipeline in the AWS Cloud with a Graphical User Interface (GUI)

Tutorial video on how to perform a core genome phylogeny analysis using the DocMind Analyst graphical user interface (GUI) 

Core Genome Phylogeny Pipeline GUI using the DocMind Analyst Software Suite

The DocMind Analyst software suite provides you with an easy-to-use graphical user interface (GUI). This makes it very easy for you to run various analytical pipelines on your own just with a few clicks and very low effort. The Phylogeny Pipeline GUI can perform phylogenetic analyses, that can be useful for a lot of purposes, e.g. for an outbreak analysis. You can reach the core genome phylogeny pipeline by clicking on the “Phylogeny” button at the home screen and then choose the “Core Phylogeny” panel as illustrated below. In order to run the entire pipeline, you just need to copy two kind of input files a a new folder that you can create in your home directory. This folder (e.g. called “Core_Phylogeny_Analysis”) is your current working directory. Be aware that you don’t have any spaces in folder or filenames. The entire path to your current working directory should not contain any spaces. You can separate parts of your filenames using an underscore (e.g. part1_part2.fasta).

Core Genome Phylogeny Pipeline GUI Input File Requirements 

For this pipeline, you must provide assembled genomes in Fasta or GenBank format (*.fasta, *.fas, *.fa, *.gb, *.gbk) in a current working directory that you can chose by using the DocMind Analyst entry mask.  Furthermore, you must also provide paired high quality FASTQ reads of your samples. Both high quality FASTQ files of a pair must have the same prefix and a “_HQ” tag to display that these are high quality reads. The R1 read FASTQ must additionally have the “_1”, the R2 read FASTQ the “_2” tag. Following the above example, the files of sample ID1 would be named “ID1_HQ_1.fastq(.gz)” and ID1_HQ_2.fastq(.gz)”.

In order to run the entire, you need to check the boxes in the “Pipeline Options” panel. You can either run all parts of the pipeline or single parts. You can also combine certain parts. For instances, you already have a core genome and want to call SNPs and generate an alignment file. In this case, just check the “SNP and INDEL calling” and “Alignment Generation” boxes. Just make sure that you either use parts that follow in a row or provide for each step the correct input files. This is explained in the single tutorials of pipeline and is explained further down in more details. By default, all boxes (and thus the single steps of the pipeline) are checked by default with the exception of the “Quality Trimming” box. This is because we assume that you have already produced high quality reads during the assembly pipeline. You can use these reads and don’t need to trim again. Moreover, these reads will already have the correct filename format.

However, if you don’t have high quality reads you can easily produce them in this pipeline by  additionally check the “Quality Trimming” box in the “Pipeline Options”. In this case, you can provide untrimmed reads in the following format: The FASTQ-Files must have the ending “.fastq” (uncompressed files) or “fastq.gz” (compressed files). Let’s say your sample name is “ID1”, please name the file with the forward reads ID1_1.fastq (or ID1_1.fastq.gz) and the file with the reverse reads ID1_2.fastq (or ID1_2.fastq.gz). The prefix of both files of one sample (here “ID1”) must be identical, so the DocMind Analyst will recognize that both files belong to the same sample. If your files do not have this structure, you can easily rename them in a bulk (Bulk File Renamer). Quality trimming will be performed by Trimmomatics. Please check out the tutorial on how to run it here.  


Core Genome Phylogeny Pipeline GUI Computational Requirements

Once all FASTQ-Files and assembly files are copied into one directory, you can choose this one as your current working directory under “System options”. There are two further options that should catch your attention. At first, you can choose the number of processors to use for this analysis. The more processors you use, the more you can speed up your analysis. However, never enter more processors than you have available on your instance. In such a case, you could severely slow down your analysis. If you are not sure, press the “Auto”-Button and it will automatically choose the number of processors that are available on your instance. For the DocMind Analyst core phylogeny pipeline, an instance with 16 processors is recommended (e.g. m5.4xlarge or c5.4xlarge with at least 32 GB RAM). If you want to use RAxML for phylogenetic reconstruction, instances with more processors and RAM (e.g.  m5.12xlarge or c5.24xlarge) can accelerate your analysis significantly and are often needed due to the high memory requirements.

The other option is the “Stop instance after job” checkbox. It is not checked by default. When you check it, your instance will stop 1 minute after the job has been finished. This is convenient when a job takes some time and you want to do something else in the meantime. This could save you money since your AWS instance would be active without doing a job. But be cautious, if you have other jobs running on the same instance they will be interrupted regardless of their status. So it’s best to use this option with only one pipeline job running.

For starting the pipeline, you need to agree to our “Terms of Services” by checking the respective box in the “License Agreement” panel. After pressing the “Submit Job” button, the job will be given a reference number. With this number, you can check the job status with the “Monitor” module you find on the “Home” screen. Look here for a tutorial.

Core Genome Phylogeny Pipeline GUI Modules

The pipeline begins with generating a core genome out of all DNA genome sequence files that you have provided. It uses Spine for core genome generation. You find here a detailed tutorial on that topic explaining all the available options. Next, you can either trim your paired sequence reads (see above for an explanation) or call structural variations. During the pipeline, this will automatically be done on the generated core genome. However, you can use any reference genome if you perform this module in a single step. Please have a look at our SNP calling with samtools tutorial where we explain you every step of the calling and also the available options. You can call either single nucleotide polymorphisms (SNPs) or short insertions/deletions (indels). However, doing a core genome phylogeny, we recommended that you only call SNPs as explained in the respective tutorial.

From the VCF files that have been produced by the SNP calling, a Multi-FASTA alignment will be generated. Here, all your genomes are piled up at each position. Look here for a tutorial that will explain you more about alignments. This alignment can already be used to infer genetic relatedness using SplitsTree 4. Look here for a tutorial on how to use SplitsTree on the DocMind Analyst instance.

Finally, the pipeline will use the FASTA alignment file to reconstruct a phylogeny. You can use between different tools for this task. Look at the following tutorial to gain more information: the gubbins tutorial, the IQTree tutorial, the FastTree tutorial, and the RAxML tutorial. It is impossible to give a general recommendation on which tool to use. We generally think that IQTree would be the best fit in most situations since it provides an automated substitution model selection and is ultra fast. However, if you think that genomic recombination play a huge role in your dataset, gubbins would be your first choice.

All phylogenetic reconstruction tools will provide you with a tree file. This file will contain the structure of your phylogenetic tree in a text file format. However, by just open such a file you won’t be able to see the tree. In order to visualize the tree and to produce high quality figures for publication, you can use programs like FigTree. Have a look here for a tutorial on how to use FigTree. 

Below you find a table with all modules tested on a m5.x2large instance type. This is the minimum recommended instance for all modules. For IQTree, it is clearly sufficient for running an analysis within a competitive time frame. Of note, Gubbins had the longest run time. However, with computational more powerful instances all running times can be futher reduced, except for IQTree which would probably not use more resources and is already very fast. The table will give you an idea about the expected running times and required resources. However, be aware that these are just examples. For larger datasets, you can’t infer the same relation since none of the program will scale in a linear way. Thus, for large datasets it might look entirely different, moreover on different instance types.    

DMA Computation Time Phylogeny pipeline GUI

At this point, you have finished the pipeline just in a few steps and with very low effort. Finally, just have a look at the “Core_Phylogeny_parameter.txt” you can find in your current working directory. Here all your parameter choices are documented. This file is very important since it will enable you to remember your parameter choices. You will appreciate it once you start writing up your manuscript. Be aware that you will overwrite this file when you start a new job from the same current working directory. Well, that’s it. Please contact us if you have questions. 

Close Menu