Tutorial on How-to-Run a Pilon Postprocessing with a Graphical User Interface (GUI)

Tutorial video on how to improve your assembly using Pilon as postprocessing tool within the DocMind Analyst GUI

Postprocessing of assembled Genomes using a Pilon GUI within the DocMind Analyst Software Suite

Assemblies are prone to errors. There is no doubt about that. However, you can improve assembly quality by polishing your assemblies with your high quality reads. What does that mean? It means that you can take your assembly and map you sequence reads against it. If the majority of your sequence reads are not in accordance with your assembly, it would indicate an error. Tools like Pilon are able to produce an improved assembly from such an analysis. 

All you need to do is to check the “Pilon Postprocessing” box in the “Pipeline Options” panel within the DocMind Analyst GUI. It is not checked by default since this is an analysis that requires additional considerable computational resources. Using the Pilon GUI postprocessing module can be recommended after a Skesa assembly, but it might also improve SPAdes and A5 assemblies. In any case, postprocessing is clearly appreciated when your scientific conclusions are significantly based on an assembly, so you always need to consider it.

Input File Requirements

You need to provide high quality reads in your current working directory. This analysis works with paired-end reads, so please make sure you provide both files per sample. Both FASTQ files of a pair must have the same prefix and a “_HQ” tag to display that these are high quality reads. The forward read FASTQ must have the “_1”, the reverse read (R2) the “_2” tag. Following the above example, files with the name “ID1” would be named “ID1_HQ_1.fastq(.gz)” and ID1_HQ_2.fastq(.gz)”. If you are running the assembly pipeline you don’t need to worry it that since these files were produced during the trimming step.

If you are not running this analysis in the pipeline but you want to polish your assemblies after a pipeline run, you also need to provide assemblies that you want to improve. These are FASTA files with the ending “*.fasta”. In your current working directory you need to create a folder named “SPAdes_Final_Assemblies”, “A5_Final_Assemblies” or “Skesa_Final_Assemblies”. You can transfer the assemblies to these folders and start the Pilon GUI module.

Module Options

If you are an advanced user, you can find two options on the right hand side in the “Assembly Polishing Advanced Options” panel.  The first option here is the mapping quality. This score is Phred-scaled and is used as an error estimation, in this case it is the probability of a sequence read to be erroneous mapped to the reference. You find the interpretation of Phred scores in the table below.

Phred Score Table

As you see, a Phred Score of 20 is usually quite accurate, with just a 1% error chance. In comparison to SNP calling where you will also find this parameter, it is recommended not to increase this value above the default of 15 unless you know exactly what you are doing. The reason is that you must assume that sometimes you will have severe assembly errors. Correct reads will not be able to map if the mapping score is set too high. On the other hand, a certain quality control at that point is needed since the analysis should improve the assembly, and not the opposite. The default of 15 can be considered a good compromise. 

The second parameter is the base quality. This score is also Phred-scaled bit here it is the probability that a base is incorrectly called by the sequencer. Like with the mapping quality, a score of 20 means an error chance of 1%. The default value of 30 is quite specific and makes sure that you assembly is truly improved only by appropriately sequenced bases. In fact, both scores mean that a base replacement in an assembly will not happen if these qualities are not fulfilled.

When you are happy with all settings, check that you agree to the Terms of Service and press the “Submit Job” button.  Once started, you will see that new files are generated in your folder with the reads. The Pilon GUI module is designed that all read pairs that you have provided in your current working directory folder will be processed, so you don’t need to do anything but wait until it is finished. You can check the status of your job in the monitor panel (home -> Monitor) by pressing the “Refresh” button. An “R” stands for a running job while “C” means that the job has been completed. In your working folder, you will find a file called “Assembly_parameter.txt”. In that file all your settings for that run are documented.

Module Output Files

The improved assemblies can be found in a folder called “Postprocessed_Assemblies”. They will have the ending “*_postpro.fasta”. Additionally you will find a VCF file for each improved assembly. The VCF file contains all structural differences that have been detected between the original assembly and the sequence reads, including position and quality scores. Read more about the VCF format here.

Input Files: FASTQ files either uncompressed (“*.fastq”) or compressed (“*.fastq.gz”). Forward reads indicated by “*_HQ_1.fastq(.gz)”, reverse reads by “*_HQ_2.fastq(.gz)”. The “*” indicates the file name. Forward and reverse read files need to have the same file name. Furthermore, you need to provide the FASTA assembly files (*.fasta) that you want to improve. The should be stored in a subfolder in your current working directory ( named “SPAdes_Final_Assemblies”, “A5_Final_Assemblies” or “Skesa_Final_Assemblies”). 

Output Files: These are FASTA files with the extension “*_postpro.fasta”. You find them in the subfolder named “Postprocessed_Assemblies” in your current working directory. 

Log-File: “Pilon.log”. You will find this file in your current working directory. Always check this file for potential error messages.   

Storage: Pilon needs some free disk space for temporary files.  When using compressed FASTQ (.gz ending), plan with 10 times the volume of both FASTQ. E.g. you have 2 x 200 MB FASTQ, allow for 4 GB free disk space. For uncompressed FASTQ plan with 2 times the storage. E.g. you have 2 x 1000 MB FASTQ, allow for 4 GB disk space. This is very important. If you fail to provide enough storage volumes, your analysis will be canceled at some point. You can change the volume size of your instance by selecting the appropriate volume in your AWS console (under volumes) and increase it. Note, you cannot decrease it. 

Recommended instance type: At least  m5.4xlarge/m4.4xlarge. When your instance is stopped, you can change your instance type by clicking on Actions -> Instance Settings -> Change Instance Type. You might need to request usage of faster instances from the AWS support.

Timeframe: Approx.  5 – 20 minutes per postprocessing (depending on the AWS instance and the sequencing coverage).

Close Menu