Tutorial on How-to-Call SNPs/Indels with a Graphical User Interface (GUI)
Tutorial video on how to detect structural variants using samtools within the DocMind Analyst software
SNP and Indel Detection using a Samtools GUI within the DocMind Analyst Software Suite
You can detect SNPs and indels in the next step of the pipeline with the our easy-to-use graphical user interface (GUI) DocMind Analyst. This module follows a popular and highly recommended workflow. In order to run the analysis, you need a reference sequence and at least one other sequence to compare against. This is visualized in the figure below. The blue sequence is the reference. In this pipeline this would be the core genome. Now you can map other sequences against this reference. In our case, these sequences will be the high quality reads. When a sequence read can be mapped against the reference and has the same sequence, no structural variant will be called. However, if the sequence is different (red bases or “-“ for a deletion), this variant will be recorded.
Input File Requirements
Principally you can use any kind of reference for this SNP/indel calling. But since this is a module within a pipeline, you need to provide the reference in the current working directory and it needs to have the following name: *.backbone.concatenated.fasta (where the * can be replaced by a filename). You don’t need to worry about it when you are running the whole pipeline since the core genome that has been generated in the previous step will automatically processed here.
Apart from a reference sequence, you need to provide paired FASTQ files with sequence reads. If your reads are untrimmed raw reads, you can check the “Quality Trimming” box in the “Pipeline Options” panel. In this case, reads will be trimmed using Trimmomatic. Have a look here at the respective trimming tutorial. Briefly, raw FASTQ-Files must have the ending “.fastq” (uncompressed files) or “fastq.gz” (compressed files). Let’s say your sample name is “ID1”, please name the file with the forward reads ID1_1.fastq (or ID1_1.fastq.gz) and the file with the reverse reads ID1_2.fastq (or ID1_2.fastq.gz). The prefix of both files of one sample (here “ID1”) must be identical, so the DocMind Analyst will recognize that both files belong to the same sample. If your files do not have this structure, you can easily rename them in a bulk (Analysis Tools -> Tools -> Bulk File Renamer).
In case you already have trimmed reads (e.g. from a previous run of the assembly pipeline), you just need to provide the high quality reads. Both high quality FASTQ files of a pair must have the same prefix and a “_HQ” tag to display that these are high quality reads. The R1 read FASTQ must additionally have the “_1”, the R2 read FASTQ the “_2” tag. Following the above example, the files of sample ID1 would be named “ID1_HQ_1.fastq(.gz)” and ID1_HQ_2.fastq(.gz)”. If you have such trimmed reads, you can uncheck the “Quality Trimming” box.
In order to run the SNP calling module, you just have to check the “SNP and INDEL calling” box in the “Pipeline Options” panel and decide about the following options:
At the left hand side of the large panel, you find the basic options, the only thing to worry about as beginner. Here, focus your attention to the “Variants calling and Phylogeny options” panel. In the combo box at the left side you can chose the tool for variant calling. At the moment, you can only chose samtools. In future, this will be extended by other callers (e.g. GATK 4). By the next two checkboxes you can indicate whether you want to call SNPs or indels or both. Be aware that for constructing a core phylogeny, you are will only need to call SNPs and no indels (the default setting).
For advanced users, you can apply a detailed fine tuning when calling and filtering structural variants. For this, you find the “Variant calling Advanced Options” panel on the right side. The first option here is the mapping quality. This score is Phred-scaled and is used as an error estimation, in this case it is the probability of an read to be erroneous mapped to the reference. You find the interpretation of Phred scores in the table below. As you see, a Phred Score of 20 is usually quite accurate, with just a 1% error chance.
The second parameter is the base quality. This score is also Phred-scaled bit here it is the probability that a base is incorrectly called by the sequencer. Like with the mapping quality, a score of 20 means an error chance of 1%. Default for both scores is a value of 30 to be quite specific. Be aware that an increase of particularly the mapping score can have significant consequences when there are large sequence variants between the reference and the sample. In such cases, reads can hardly be mapped even if mapping would be correct due to the sequence differences. A mapping score of more than 40 needs some good explanation. In fact, both scores mean that a variant will not be called if these qualities are not fulfilled. For instance, a SNP would not be called if the chance of a sequencing error at this genomic position would be 2% but you have chosen a base quality score of 20.
Another advanced option is the “Min Depth” parameter. Here you can specify the minimum number of reads that cover a position. For instance, of you specify a minimum depth of three at least three reads must cover the variant position for the variant to be called. The higher the parameter, the more certain is the existence of the variant. Values that are very high might reduce sensitivity since low covered positions would be missed.
The “Max Depth” parameter is somewhat the opposite. With this parameter you can make sure that very highly covered genomic regions are excluded from variant calling since these regions cannot be trusted. You can specify an integer that is a multiplier of the average coverage. One example: The default value of 3 means that all genomic regions with a three times higher coverage than the average coverage will be excluded from variant analysis.
Particularly interesting is the “Population ratio” option. This value is given as fraction of one. The default value of 0.8 means that 80% of the reads need to show the alternative base and not the reference base. This is based in the fact that not all bacteria in a population must have a variant at this position. Some might have the reference base at this position. The choice of this value would depend on what you want. If you simply want to call “true” variants that are present in all bacteria of a population, you need to set the value to “1”. If you want to allow quasispecies in your analysis, you might even decrease the value. Most authors of scientific publication use values between 75% and 90%.
Finally you can modify the “SNP gap” parameter. This parameter will filter SNPs around an indel within the given distance. For instance, a value of 3 will filter SNPs that have been detected within 3 base pairs of an indel. A value of “0” means no filtering.
Module Output Files
For each sample, three output files are of interest. The first two can be found in an automatically created folder named “SNP_Results”. The most important file ends with “_final.vcf”. The prefix of the file is the sample name and the name of the core genome using the structure “sample_name_vs_core_name”. The VCF file contains all structural variants that have been called, including position and quality scores. Read more about the VCF format here. Briefly, the first column is labels as “chrom” and is simply the name of the genomic sequence (e.g. contig name), followed by the genomic “position” at the chromosome. Next column is the “ID” which is usually empty. Next is “REF” and indicates the base at this position in the reference genome. This is followed by the alternative base (“ALT”) that has been detected instead at this position (SNP) or the insertion/deletion. The last important column is called “QUAL” and is the quality score (phred-scaled) at this position.
Another file that you can find in the newly created subfolder ends with “_stats.txt”. In that file you can find various statistics about the detected structural variants. One statistic that is particularly important is the transition/transversion ratio which you can find in the statistic file. A transition is the replacement of a purine by a purine or a pyrimidine by a pyrimidine. A transversion on the other hand occurs when a purine changes to a pyrimidine, or vice versa. Since there a twice as many possible transversions as transitions, this ration would be 0.5 if there is no bias towards transitional or transversional substitutions. For that reason, this ratio is often considered a good measurement for the quality of variant calling. The last output file can be found in the current working directory. It starts with “final” and ends with “.bam”. This BAM file can be used to visualize the mapping of reads against the reference by using a viewer tool like IGV (preinstalled at the instance, find an IGV tutorial here).
Input Files: FASTQ files either uncompressed (“*.fastq”) or compressed (“*.fastq.gz”). Forward reads indicated by “*_HQ_1.fastq(.gz)”, reverse reads by “*_HQ_2.fastq(.gz)”. The “*” indicates the file name. Forward and reverse read files need to have the same file name. Furthermore, you need to provide the FASTA reference file (*.backbone.concatenated.fasta). The files should be stored in your current working directory.
Output Files: In your current working directory you will find BAM-Files which are useful for SNV visualization with IGV. Otherwise, the main output files can be found in the folder “SNP_Results”. The VCF-File (ending “_final.vcf) contains all SNP/Indel data. The files ending with “_stats.txt” will provide you with important statistics of the calling.
Log-File: “samtools.log”. You will find this file in your current working directory. Always check this file for potential error messages.
Storage: Samtools needs some free disk space for temporary files (e.g. SAM-Files). When using compressed FASTQ (.gz ending), plan with 10 times the volume of both FASTQ. E.g. you have 2 x 200 MB FASTQ, allow for 4 GB free disk space. For uncompressed FASTQ plan with approx 4 times the storage. E.g. you have 2 x 900 MB FASTQ, allow for 8 GB disk space. This is very important. If you fail to provide enough storage volumes, your analysis will be canceled at some point. You can change the volume size of your instance by selecting the appropriate volume in your AWS console (under volumes) and increase it. Note, you cannot decrease it.
Recommended instance type: At least m5.4xlarge/m4.4xlarge. When your instance is stopped, you can change your instance type by clicking on Actions -> Instance Settings -> Change Instance Type. You might need to request usage of faster instances from the AWS support.
Timeframe: Approx. 10 – 15 minutes per calling (depending on the AWS instance and the sequencing coverage).