Tutorial on How-to-Run a 16S rRNA Microbiome Classification Pipeline in the AWS Cloud with a Graphical User Interface (GUI)
Tutorial video on how to determine a 16S rRNA microbiome profile using a graphical user interface (GUI) with the DocMind Analyst
Taxonomic Classification with a 16S rRNA Microbiome Pipeline GUI using the DocMind Analyst Software Suite
After amplification and sequencing of the 16S rRNA region of your metagenomic sample you most certainly wish to reconstruct the taxonomic composition of your sample. This is not an easy task. While 16S sequencing is inexpensive and thus still a popular methodology for studies with a high sample size, you need to know the limitations of 16S sequencing:
- You will introduce a PCR bias which is likely to skew the underlying taxonomic composition.
- You will just be able to reconstruct your sample’s bacterial taxonomy, but not it’s functional composition (e.g. gene functions since you have just sequenced 16S rRNA) or other microorganisms like parasites, fungi, viruses.
- Taxonomic composition will depend on your choice of primers. This is a very critical point. You can compare samples using the same primers but never among samples sequenced with different primers.
- With short read, alignment to a 16S database might not be specific enough. In theory, you might merge overlapping reads but this is usually error prone, moreover after read trimming and short overlaps. It is highly recommended that your reads are at least 250 – 300 bp long (e.g. MiSeq sequencing). Still, all classifiers will do errors while classifying with short read data.
- About 1 – 5 % of your sequence reads can be chimeric. Such chimeras are formed from two or even more biological sequences that act as templates during PCR. They are rare with shotgun sequencing, but are common in amplicon sequencing since more closely related marker sequences are amplified. It is believed that most chimeras arise from incomplete extension. Such a partially extended strand can then bind to a template from a different but similar sequence. It will act as a primer that is further extended to a chimeric sequence.
Input File Requirements
Our pipeline is designed to address the dry-lab analytical issues, such as chimeric sequences. All you need to do is to copy your FASTQ Files with your sequence read into a folder of your choice. This will be your current working directory. Our pipeline functions with paired-end reads. That means that you got two FASTQ-Files per sample, one containing all forward (R1) reads, while the other contain the reverse (R2) reads. FASTQ files contain sequences and information about the quality of these sequences at each position. These file must have a certain name structure to work in the pipeline. The must have the ending “.fastq” (uncompressed files) or “fastq.gz” (compressed files).
Let’s say your sample name is “ID1”, please name the file with the forward reads ID1_1.fastq (or ID1_1.fastq.gz) and the file with the reverse reads ID1_2.fastq (or ID1_2.fastq.gz). The prefix of both files of one sample (here “ID1”) must be identical, so the DocMind Analyst will recognize that both files belong to the same sample. IF your files do not have this structure, you can easily rename them in a bulk (Renamer Tool Tutorial). After starting the DocMind Analyst, you can start the 16S rRNA Pipeline from the home screen as illustrated below (Home -> 16S rRNA Analysis -> 16S rRNA Pipe).
After copying all FASTQ-Files into one directory, choose this one as your current working directory under system options. There are two further options that should catch your attention. At first, you can choose the number of processors to use for this analysis. The more processors you use, the more you can speed up your analysis. However, never enter more processors than you have available on your instance. In such a case, you could severely slow down your analysis. If you are not sure, press the “Auto”-Button and it will automatically choose the number of processors that are available on your instance. For the DocMind Analyst 16S rRNA Analysis pipeline, an instance with 16 processors or more is recommended (e.g. m5.4xlarge or c5.4xlarge with at least 32 GB RAM) ans have at least as much free disk space as 6 times the disk space required for all your input files.
The other option is the “Stop instance after job” checkbox. It is not checked by default. When you check it, your instance will stop 1 minute after the job has been finished. This is convenient when a job takes some time and you want to do something else in the meantime. This could save you money since your AWS instance would be active without doing a job. But be cautious, if you have other jobs running on the same instance they will be interrupted regardless of their status. So it’s best to use this option with only one pipeline job running.
If you want to run the entire pipeline, you can check all boxes in the “Pipeline Options” panel. In this case, you need to set the different options for all modules. Usually, you will be good when choosing the default parameters. However, you should have at least one look at the Basics options.
The pipeline starts with trimming your reads and thus increasing their quality. The options are explained in a great detail in the trimming tutorial. While this tutorial will explain the use of the trimming module in the short read assembly pipeline, it is exactly the same for this pipeline. Afterwards, quality will be controlled using FastQC. This is also described in the trimming tutorial.
Once you have high quality reads, it must be the objective to remove chimeric reads. The respective module that uses the mothur software suite is explained in the chimera tutorial. Since this runs automatically without any options to decide about, it is pretty easy.
It follows the most important module of this pipeline. After the reads are free of chimera, the popular classifier RDP is used to assign each read to a taxonomic unit – if possible. In the best case, a sequence read can be classified up to the species level. If this is not possible, the read is assigned to less specific taxonomic levels like family or order or even just the phylum. In some cases, a read cannot be classified (unknown sequence). Then, it will appear in the unclassified section. The most important CSV output files for all taxonomic levels can be found in the folder “Results_sample_set”. The RDP classifier tutorial will explain all module options and the output format in a great detail.
Finally, a hierarchical cluster analysis will be performed and the results will be displayed as a heatmap. This analysis shows you how much the microbiomes of your samples are genetically related. This can be very useful in case you want to perform a comparative analysis. You will see differences among your samples at one glance. You can see an example below. The hierarchical clustering tutorial will help you to decide about the best fitting parameters for your dataset.
Once you have set all options, you can proceed by agreeing to the terms of services and submit the job by pressing the “Submit Job” button. Once started, you will see that new files are generated in your folder with the reads (current working directory). The pipeline is designed that all read pairs that you have provided in your current working directory folder will be processed, so you don’t need to do anything but wait until it is finished.
You can check the status of your job in the monitor panel (home -> Monitor) by pressing the “Refresh” button. An “R” stands for a running job while “C” means that the job has been completed. In your working folder, you will find a file called “16S_Pipe_parameter.txt”. In that file all your settings for that run are documented. This is important when you write up your manuscript and you don’t remember the parameters that you have chosen. However, be aware that if you start another pipeline job, this file will be overwritten with the new parameter.