Tutorial on How-to-Generate a Multiple Alignment from VCF-Files with a Graphical User Interface (GUI)
Tutorial video on how to generate a multiples alignment from samtools output using the DocMind Analyst GUI
Multiple Alignment GUI within the DocMind Analyst Software Suite
Once you have called SNPs and the “_final.vcf” files, you can generate a multiple alignment. This file is the basis for many genetic relatedness analyses. It is a file where the genomes of a dataset are piled up. When you use the core genome (as it is the case in this pipeline), each genomic position is compared to the same position in another genome. They will either share the same base at this position – or not. In the latter case, if a base is substituted for another base, it is called a SNP (single nucleotide polymorphism). An example of an alignment is given below. Seq1 is the reference, and compared to that reference you find a SNP in Seq2 (at position 3, G -> C) and in Seq3 (at position 7, G -> T).
Input File Requirements
In order to construct this alignment, you will need to provide the reference sequence. This could be any sequence that you have used as a reference for SNP calling. Within the core phylogeny pipeline, this would be the core genome. This reference must be located in your current working directory and must have the ending “*.backbone.concatenated.fasta”. This is regardless of whether it is a core genome that has been created in this pipeline or whether it is a sequence that you have used as reference for SNP calling in the samtools module.
In your current working directory you also need a subfolder named “SNP_Results”. The VCF files of all genomes (*_final.vcf) need to be located in that subfolder. As ususal, if you run the whole pipeline, you don’t need to worry since all files will be generated by the previous steps. The DocMind Analyst will now use GATK to produce a FASTA-File of each genome in the dataset. For this, it will take the reference genome and the SNP information and will generate an “image” of the reference. If a SNP was called it will write this base at this position. For instance, if the reference would have the sequence “ATTCCGG”, and the respective genome is known to have a substitution at position 3 (T -> A), it would write the sequence “ATACCGG” for this genome.
Finally, it will be piled up to a MULTI-FASTA file. This is a plain and simple way to get you to an alignment. However, please be aware that there are some caveats: (i) If there is more than one base substitution at a position (called multiple variants), it will choose one of them randomly. (ii) Only simple indels will work. In case you really want a core SNP phylogeny, you should avoid calling indels at all. In order to start the alignment generation you just need to check the “Alignment Generation” box in the “Pipeline Options” panel and submit the job..
Module Output Files
After completion of the alignment generation, the DocMind Analyst will create a subfolder in your current working directory named “Alignment”. The alignment is called “Core_Alignment.faa”. You will also find it in the NEXUS format with the ending “.nex”. It can be used as input file for subsequent analyses. You may be interested in a maximum likelihood phylogenetic analysis that would reconstruct branch lengths highly accurate. In this case, you can continue with the next tutorial, e.g. performing a phylogenetic reconstruction with Gubbins.
However, sometime you just want to construct a simple tree that indicates the genetic relatedness of your samples (e.g. outbreak analysis). In this case, a minimum spanning tree indicating the SNP distances between isolates is a fine choice for you. You can load your alignment file into SplitsTree and use its functions. You can start SplitsTree directly from the DM Analyst by pressing the respective button in the Splits Tree panel. This program has also some other very rapid analysis techniques that are highly recommended to check out by reading its excellent manual. DocMind Analytics also provides a tutorial that shows you how to construct a minimum spanning tree from the generated alignment using SplitsTree 4.
Input Files: The FASTA reference file with the ending “*.backbone.concatenated.fasta” in the current working directory. Additionally, you need the VCF files for each genome (ending “*_final.vcf”) in a subfolder of your current working directory called “SNP_Results”.
Output Files: One FASTA file called “Core_Alignment.faa” and one NEXUS file called “Core_Alignment.nex”. You find these files in the subfolder named “Alignment” in your current working directory.
Log-File: “alignment.log”. You will find this file in your current working directory. Always check this file for potential error messages.
Storage: You need approx. the sum of all single genome files with file size of the core genome. Example: If your alignment should consists of 10 genomes and your reference/core genome is 5 MB large, you need at least 2x genome number in your core multiplied with your core genome size (here 10*2 * 5 MB = about 100 MB disk space.) This is very important. If you fail to provide enough storage volumes, your analysis will be canceled at some point. You can change the volume size of your instance by selecting the appropriate volume in your AWS console (under volumes) and increase it. Note, you cannot decrease it.
Recommended instance type: At least m5.2xlarge/m4.2xlarge. When your instance is stopped, you can change your instance type by clicking on Actions -> Instance Settings -> Change Instance Type. You might need to request usage of faster instances from the AWS support.
Timeframe: Approx. 1 minute per genome (depending on the AWS instance and the sequencing coverage).