Tutorial on How-to-Run a Whole Genome Sequencing Assembly Pipeline with a Graphical User Interface (GUI)

Tutorial video on how to run a WGS assembly pipeline with DocMind Analyst Graphical User Interface (GUI)

A Whole Genome Sequencing Assembly Pipeline GUI using the DocMind Analyst Software Suite

After sequencing on an Illumina sequencer, you are provided with files that contain short sequence reads, usually about 100 – 300 base pairs. When these sequences are derived from one organism, one of your aims might be to assemble these sequences into the whole genome of the sequenced organism. This can easily and rapidly be done using the Assembly Pipeline in the DocMind Analyst (Home -> Whole Genome Seq Short Reads -> Assembly) that comes with a full graphical user interface  (GUI) for  the required bioinformatics tools.

The flowchart below displays the components of the assembly pipeline. There are plenty of video and text tutorials which cover each of these steps in a great detail. With the DocMind Analyst in the AWS cloud you will be able to produce high quality genome assemblies for low to high size project scales since you can use many virtual high end computers in parallel. This tutorial gives you a short overview. We recommend to read or watch the tutorials of each step for a detailed introduction to the topics.   

WGS_Assembly_Pipeline_GUI

WGS Assembly Pipeline GUI Modules

When you start assembling data, you need to be aware that sequence reads (usually in the FASTQ file format) that come directly from the sequencer may contain bad erroneous sequences with a low quality or sequence adapters. It is highly recommended that both are removed since they can heavily disturb your genome assembly. For this reason, the DocMind Analyst assembly pipeline starts with a quality trimming step using Trimmomatics. It will subsequently report the quality of the trimmed reads using FastQC. Have a look at the Trimming Tutorial for more details.

The next step is the assembly. DocMind Analyst provides you with three assembler. SAPdes is probably the most popular one. It produces high quality assemblies and is sufficiently fast (SPAdes tutorial here). Skesa is a young assembler which promises an ultrafast and high quality assembly. We have good experience with Skesa. It is indeed very fast although it seems to generally produces more contigs with a lower N50 (Skesa tutorial here). For MiSeq data, you might want to check out the A5 assembler (A5 tutorial here). 

After assembly, you can consider a postprocessing with Pilon. Although this step can be skipped, it is highly recommended to improve the quality of your assembly. Have a look at the Pilon tutorial here

Finally, you can check the quality of your assembly with the integrated statistics module of the DocMind Analyst. It will give you a great idea whether your assembly was successful. It furthermore will provide you with MLST sequence types for an immediate assessment of the genetic relationship of your samples. Watch the tutorial here.    

Input File Requirements

As explained in the video tutorial above, you can select the whole pipeline by checking all boxes in the “Pipeline Options” panel. This is convenient since you can sit back and wait for your results. You just need to create a folder and copy your paired FASTQ files in there. Forward read files should have the ending “_1.fastq(.gz)”, reverse read files the ending “_2.fastq(.gz)”. If you don’t have your file names in such a format, you can rename them in a bulk by using our renamer module. Watch a short tutorial for this module here.       

Module Computational Requirements

It is important to consider that the programs within the pipeline might large produce temporary or output files. This is particularly the case if you use SPAdes, A5, Pilon or Trimmomatic. Please check the respective tutorials. This will give you an idea how much free storage space you need on your pipeline. A common example would be the use of Trimmomatic and SPAdes (without Pilon postprocessing). Here, it is recommended that you have free disk space of 15 x the volume of compressed input files and 5 x the volume of uncompressed input files. For instance, if you have 2 x 200 MB FASTQ.GZ input files, allow for at least 6GB for that sample.

You can change the maximum volume size of your instance very easily in case it is needed. Look here for a description. Be aware that AWS only allows to increase the volume size of an instance. You cannot decrease it. Thus, after you have finished your job, you can transfer the output files to your local hard drive and terminate the EC2 instance. In that case, you will not be charged for the storage size of your instance. Of course, if you need the storage for another job you can use the same instance.      

When you decide to run single parts of the pipeline, you can easily select them by just checking the respective boxes in the “pipeline Options” panel. However, in this case you need to provide the input files in the format that the DocMind Analyst would expect when the full pipeline would run. This is very easy and is explained in the respective tutorials in a great detail. 

After accepting the Terms of Service, you can start the pipeline by pressing the “Submit Job” button. The pipeline is designed that all files you have provided in your current working directory folder will be processed, so you don’t need to do anything but wait until it is finished. You can check the status of your job in the monitor panel (home -> Monitor) by pressing the “Refresh” button. An “R” stands for a running job while “C” means that the job has been completed and “Q” that the job is waiting until resources are free. In your working folder, you will find a file called “Assembly_parameter.txt”. In that file all your settings for that run are documented.

Enjoy your analysis. In case you have questions, contact us.   

Close Menu