Tutorial on How-to-Do a SPAdes Assembly with a Graphical User Interface (GUI)
Tutorial video on how to assemble your Illumina sequence reads using the SPAdes assembler with a GUI
Whole Genome Sequence (WGS) Assembly using a SPAdes GUI within the DocMind Analyst Software Suite
SPAdes is one of the most frequently used assembler in computational science. It contains a lot of options and is known to be one of the best assemblers. You can choose SPAdes by selecting it in the combo box in the “Assembler Choice” panel and checking the Assembler box in the “Pipeline Options” panel. Of note, you can only use one assembler per analysis run. As for the quality control step, you need to provide trimmed FASTQ high quality files in the working directory. This SPAdes GUI module works with paired-end reads, so please make sure you provide both files per sample. If you want to assemble unpaired single-reads, contact our support.
Input File Requirements
Input files are trimmed high quality FASTQ files. Both FASTQ files of a pair must have the same prefix (filename) and a “_HQ” tag to display that these are high quality reads. The forward read (R1) must have the “_1”, the reverse read (R2) the “_2” tag. Following the above example, files with the name “ID1” would be named “ID1_HQ_1.fastq(.gz)” and ID1_HQ_2.fastq(.gz)”. If you are running the assembly pipeline you don’t need to worry about it since these files are produced during the trimming step.
As a beginner, there is only one SPAdes basic option in the “Basic Assembly Options” panel that needs your attention. You need to decide whether you want to use the “careful option” (default) or not. This option tells SPAdes to run the Mismatch Corrector which can reduce the number of mismatches and short indels in the final contigs/scaffolds. It is highly recommended for small genomes like bacterial genomes. However, if you want to assemble larger genomes like eukaryotic genomes it is not recommended to use it. So it will depend on your sequenced target organism.
For SPAdes, there are two more options in the “Advanced Assembly Options” panel. We recommend that you use this pipeline with Illumina sequence reads. However, it is also possible to use paired end reads produced by Ion Torrent. If this is the case, you need to specify this by using the correct sequencing technology under “Sequencing Method”. If you have single cell data, you also need to specify this by checking the “Single Cell Data” box.
When you are happy with all settings, check that you agree to the Terms of Service and press the “Submit Job” button. Once started, you will see that new files are generated in your folder with the reads. The SPAdes GUI module is designed that all read pairs that you have provided in your current working directory folder will be processed, so you don’t need to do anything but wait until it is finished. You can check the status of your job in the monitor panel (home -> Monitor) by pressing the “Refresh” button. An “R” stands for a running job while “C” means that the job has been completed. In your working folder, you will find a file called “Assembly_parameter.txt”. In that file all your settings for that run are documented.
Module Output Files
SPAdes will generate folders named after your samples. These folders contain several output files SPAdes produces during the assembly. Please check SPAdes documentation if you want to know more about these files. If you are just interested in your assembled genomes you can find them in a folder named “SPAdes_Final_Assemblies”. The assemblies will have the same prefix you used in your FASTQ filenames (e.g. ID1) and the ending “.fasta”. Congratulation! At this point you have finished assembly of your sequence reads.
Check list for SPAdes
Input Files: FASTQ files either uncompressed (“*.fastq”) or compressed (“*.fastq.gz”). Forward reads indicated by “*_HQ_1.fastq(.gz)”, reverse reads by “*_HQ_2.fastq(.gz)”. The “*” indicates the file name. Forward and reverse read files need to have the same file name.
Output Files: These are FASTA files with the extension “*.fasta”. You find them in the subfolder named “SPAdes_Final_Assemblies” in your current working directory.
Log-File: “SPAdes.log”. You will find this file in your current working directory. Always check this file for potential error messages.
Storage: SPADes needs some free disk space for temporary files. When using compressed FASTQ (.gz ending), plan with 10 times the volume of both FASTQ. E.g. you have 2 x 200 MB FASTQ, allow for 4 GB free disk space. For uncompressed FASTQ plan with 2 times the storage. E.g. you have 2 x 1000 MB FASTQ, allow for about 4GB GB disk space. This is very important. If you fail to provide enough storage volumes, your assembly will be canceled at some point. You can change the volume size of your instance by selecting the appropriate volume in your AWS console (under volumes) and increase it. Note, you cannot decrease it.
Recommended instance type: At least m5.4xlarge/m4.4xlarge. When your instance is stopped, you can change your instance type by clicking on Actions -> Instance Settings -> Change Instance Type. You might need to request usage of faster instances from the AWS support.
Timeframe: Approx. 10 – 20 minutes per assembly (depending on the AWS instance and the sequencing coverage).