Tutorial on How-to-Do Improve Your Sequences using a Read Trimming GUI
Tutorial video on how to trim your Illumina sequence reads using the Trimmomatic software suite with a GUI
Improve sequence quality using the Read Trimming GUI with the DocMind Analyst Software Suite
After sequencing on an Illumina sequencer, you are provided with files that contain short sequence reads, usually about 100 – 300 base pairs. When these sequences are derived from one organism, one of your aims might be to assemble these sequences into the whole genome of the sequenced organism. This can easily and rapidly be done using the Read Trimming GUI module in the DocMind Analyst (Home -> Whole Genome Seq Short Reads -> Assembly) that comes with a full graphical user interface for the required bioinformatics tools.
Be aware that sequence reads that come directly from the sequencer may contain bad erroneous sequences with a low quality or sequence adapters. It is highly recommended that both are removed since they can heavily disturb your genome assembly. For this reason, the Read Trimming GUI module starts with a quality trimming step using Trimmomatics. It will subsequently report the quality of the trimmed reads using FastQC. You can find a check list at the end of this tutorial.
Input File Requirements
If you already have high quality read files, you may skip this step by unchecking the Quality Trimming checkbox at the “Pipeline Options” panel at the right side. If not you should perform this step. The pipeline functions with paired-end reads. That means that you got two FASTQ-Files per sample, one containing all forward (R1) reads, while the other contain the reverse (R2) reads. FASTQ files contain sequences and information about the quality of these sequences at each position. These file must have a certain name structure to work in the module. The must have the ending “.fastq” (uncompressed files) or “fastq.gz” (compressed files). Let’s say your sample name is “ID1”, please name the file with the forward reads ID1_1.fastq (or ID1_1.fastq.gz) and the file with the reverse reads ID1_2.fastq (or ID1_2.fastq.gz). The prefix of both files of one sample (here “ID1”) must be identical, so the DocMind Analyst will recognize that both files belong to the same sample. IF your files do not have this structure, you can easily rename them in a bulk file renamer module.
Copy all FASTQ-Files into one directory and choose this one as your current working directory under system options. There are two further options that should catch your attention. At first, you can choose the number of processors to use for this analysis. The more processors you use, the more you can speed up your analysis. However, never enter more processors than you have available on your instance. In such a case, you could severely slow down your analysis. If you are not sure, press the “Auto”-Button and it will automatically choose the number of processors that are available on your instance. For the Read Trimming GUI module, an instance with 16 processors is recommended (e.g. m5.4xlarge or c5.4xlarge with at least 32 GB RAM).
The other option is the “Stop instance after job” checkbox. It is not checked by default. When you check it, your instance will stop 1 minute after the job has been finished. This is convenient when a job takes some time and you want to do something else in the meantime. This could save you money since your AWS instance would be active without doing a job. But be cautious, if you have other jobs running on the same instance they will be interrupted regardless of their status. So it’s best to use this option with only one pipeline job running.
In the next step, choose the criteria for the quality trimming of the sequence reads. As a beginner, it is recommended that you just have a look at the basic options and go with the default settings of the advanced options. Basic options for quality trimming with Trimmomatic are very straightforward. You simply need to decide about the minimum read length. This is because the tools will remove badly sequenced base pairs from both sides of the read. This will shorten the reads, sometimes significantly. Very short reads, even with a good quality, could disturb the assembly process. For this reason it is a good idea to discard sequence reads once they become too short after trimming. You can set a minimum length. The default of 100 base pairs is a good point to start with, but your choice might be different according to your objectives.
If you are already more familiar with this analysis, it is a good idea to take a look at the right side of the assembly pipeline panel. There you can find the advanced options for each software tool that is used in the assembly pipeline. At this point, let’s start with the advanced Trimmomatic options. In the “Leading” option, you can use the quality threshold for cutting bases of the start of a read. The default of “3” means that a base is removed once its quality is below 3. The “Trailing” option is exactly the same as the “Leading” option, but it will cut bases from the end of a read.
Trimmomatic will also perform a sliding window approach. It will consider multiple bases at once and calculates the average quality score of that “window”. It will only remove bases when the quality of the window drops below a certain threshold. Such an approach has the advantage that single poor bases will not result in the removal of high quality data in your approach. You can specify the length of that window in the “Sliding Window Size” option, while the threshold for the quality can be set in the “Sliding Window Quality” option. In the beginning it is best to use the default values. Later on, you may try different settings. An increase in the average quality requirement would result in the removal of more bases, while increasing the window size would further decrease the importance of bad bases in the window.
When you decide to perform only the quality trimming by unchecking all other parts of the pipeline in the “Pipeline Options” panel, you can proceed by agreeing to the Terms of Services and submit the job by pressing the “Submit Job” button. Once started, you will see that new files are generated in your folder with the reads. The Read Trimming GUI module is designed that all read pairs that you have provided in your current working directory folder will be processed, so you don’t need to do anything but wait until it is finished. You can check the status of your job in the monitor panel (home -> Monitor) by pressing the “Refresh” button. An “R” stands for a running job while “C” means that the job has been completed. In your working folder, you will find a file called “Assembly_parameter.txt”. In that file all your settings for that run are documented.
Module Output Files
Once finished, you will get two output files. They start with the sample name and end with the same extension as before, but a “_HQ” tag is included to display that these are high quality reads. Following the above example, the files would be named “ID1_HQ_1.fastq(.gz)” and ID1_HQ_2.fastq(.gz)”. These files contain only paired reads. This means if one read of a pair was discarded, the other was discarded too, even if its quality was appropriate. This is convenient since many tools would have problems in downstream analyses when unpaired sequences turn up in your FASTQ files. Trimmomatic will also trim Illumina adapters if they are still present in the FASTQ files. In case you have decided to run further parts of the WGS assembly pipeline, the next step would be to measure the quality of your reads after the trimming process using FastQC.
Check list for Trimmomatics
Input Files: FASTQ files either uncompressed (“*.fastq”) or compressed (“*.fastq.gz”). Forward reads indicated by “*_1.fastq(.gz)”, reverse reads by “*_2.fastq(.gz)”. The “*” indicates the file name. Forward and reverse read files need to have the same file name.
Output Files: High quality read files name “*_HQ_1.fastq(.gz)” and “*_HQ_2.fastq(.gz)”. You find them in your current working directory with the untrimmed files.
Log-File: “trimmomatic.log”. You will find this file in your current working directory. Always check this file for potential error messages.
Storage: At least twice as much as you need to store your FASTQ files. E.g. you have 10GB of FASTQ files you will need at least 20GB Storage. You can change the volume size of your instance by selecting the appropriate volume in your AWS console (under volumes) and increase it. Note, you cannot decrease it.
Recommended instance type: At least m5.2xlarge/m4.2xlarge. When your instance is stopped, you can change your instance type by clicking on Actions -> Instance Settings -> Change Instance Type. You might need to request usage of faster instances from the AWS support.
Timeframe: Approx. 5 -10 minutes per FASTQ pair (depending on the AWS instance and the sequencing coverage).
Sequence reads quality control with FastQC
After read trimming, it is recommended to check the trimmed reads in order to evaluate whether the reads are of sufficient quality, particularly in case the following assembly does not work appropriately. For this, you just need to provide sequence reads with the “_HQ” tag (e.g. ID1_HQ_1.fastq(.gz)) in the current working directory. Check the Quality Control FastQC checkbox. In case you are running the whole assembly pipeline you do not need to provide any files since these files will be produced in the previous step of the pipeline (trimming).
Output files will be html files with the same prefix as your FASTQ-Files. You can open them with any internet browser (e.g. firefox, chrome). A lot of parameters are evaluated during this step, e.g. base quality, GC content distribution, clustering. Please check the FastQC documentation for more information and background. The respective log file is called “fastqc.log” and you will find it in the current working directory. Always check for error messages. After quality control, you have three options to assemble the short sequence reads into a whole genome (SPAdes, Skesa, A5).