Tutorial on How-to-Make Taxonomic Classification using RDP with a Graphical User Interface (GUI)
Tutorial video on how to classify your 16S rRNA sequence reads using RDP within the DocMind Analyst graphical user interface (GUI)
Microbiome Classification with a RDP GUI within the DocMind Analyst Software Suite
Within this module, each of the high quality 16S rRNA sequence reads will be classified to a taxonomic rank using the Ribosomal Database Project (RDP) classifier. RDP is a classifier trained on known type strain 16S sequences of 880 genera. When a one of your sequence reads is analyzed, it will be considered to be a member of the genera with the highest probability. This analysis is repeated 100 times (bootstrapping), and the number of times a certain genus is most likely will give a confidence in the assignment of that genus. Generally, assignments of sequences reads are the following taxonomic ranks: domain, phylum, class, order, family and genus.
As you are probably aware, the chance for a correct assignment is lower for specific ranks like genus and family, and higher for less specific ranks like phylum. RDP does not classify species. This might be considered a disadvantage, but it is not since classifications on species level using 16S rRNA reads are often false and would lead to wrong conclusions in your analysis. Even if you input high quality reads produced by this pipeline in other classifier, be always aware of this general limitation. Species classification is better on shotgun metagenomics data but you need also be quite careful as long as you use short reads.
Input File Requirements
The RDP classifier is very easy to use. You have to provide a subfolder called “HQ_reads” in your current working directory. Here, you need to locate the high quality sequences reads of at least 50 base pair length from each sample in one FASTA file with the ending “_HQ.fasta”. The prefix is usually your sample name, e.g. ID1. This would make up the filename “ID1_HQ.fasta”. You don’t need to worry about this if you are using the whole pipeline as it is recommended. In this case everything is generated automatically for you. You finally need to check the “RDP Analysis” box in the “Pipeline Options” panel.
There are two basic options you need to consider. You find them in the “16S Classifier Options” panel. First let’s talk about copy number adjustment. Taxonomic analysis means that you count the number of reads assigned to a specific taxon. For instance, you have 200 reads assigned to genus A and 150 reads assigned to genus B. You might conclude that genus A is more abundant in your dataset compared to genus B. This could be wrong since different bacterial organisms can have different numbers of 16S rRNA copies per cell. If genus A has twice as much 16s rRNA copies in its cells it appears to be more abundant compared to genus B. But in fact, there were a lower number of genus A cells in the sample, they just had more 16S rRNA copies that were sequenced. RDP classifier gives you the possibility to adjust for that by weighting each sequence read by (1 / mean 16S gene copy number). The information about the mean copy number is obtained from the rrnDB database. If there is no copy number available for a taxon, the mean copy number of its parent is used. Well, and this is the problem. If you adjust, you might just introduce another bias. It’s a hard decision to make, and for that reason copy number adjustment is not a default in this pipeline.
Apart from copy number differences, when you perform a comparative analysis, meaning you compare the taxonomic profile of one sample with another, you need to determine a relative abundance. Here, the total number of reads in your analysis play a huge role. Having 200 reads assigned to taxon A in sample 1 and 400 reads assigned to taxon A in sample 2 does not automatically mean that taxon A in more abundant in sample 2. If you have twice as much reads sampled in sample 2 compared to sample 1, you have a very good chance to assign a read number twice as high to a certain taxon in sample 2. But the actual count of genus A is the same in both taxa. In our opinion, this is the much better correction factor for a comparative analysis than the copy number adjustment. We come to that later when the output files are described.
The second important parameter is the RDP confidence score. As described above, for each taxonomic assignment for all ranks, the RDP classifier estimates the reliability of the classification using bootstrapping. For now, you can give a confidence threshold. The default is 0.8, meaning 80%. If a sequence read cannot be assigned with an RDP confidence score above this 80% threshold, it is displayed under the “unclassified taxon”. With this parameter, you can decide how specific your assignment is. For reads shorter than 250 base pairs, an RDP confidence score of 0.5 (50%) was shown to work better for genus level assignments. Generally, the choice depends also on the variable region you have sequenced. For guidance, it is highly recommended to read the “Confidence Threshold” chapter on the RDP website. In order to reach this website directly from the DM-Analyst, you can press the “Details” button at the “16S Classifier Options” panel.
Module Output Files
There are a huge number of output files that you can find after the analysis. The most important once can be found in a subfolder of your current working directory called “Results_sample_set”. In that folder, you find the following files: “Taxa_genus_merged_final.csv”, “Taxa_family_merged_final.csv”, “Taxa_order_merged_final.csv”, “Taxa_class_merged_final.csv”, and “Taxa_phylum_merged_final.csv”. These files are probably want you want most. For each sample, you will find the relative abundance for each taxonomic unit for the specified rank (e.g. phylum). This is what you need to compare each sample to the others. You can easily load a CSV file into Excel and make further analyses and graphs there.
You need to be aware, that you will not look at the absolute abundance, so the read number that was assigned to a taxon (copy number adjusted or not). Regardless of copy number adjusted or not, you look at a relative abundance. Here, for any taxon, we divided the assigned reads by the total number of sample reads, and multiplied the ratio by 1 million. This leads to the unit “hits per million input reads (hpm)”. In other word, you see the number of reads assignments to that taxon if your sample would have 1 million sequence reads. This is a common way to normalize data and make fair comparison between samples. If you are interested in the absolute abundance, you can have a look at the RDP output files for each sample. They end with “_hier.txt” and will start with “cnadjusted_” in case you performed copy number adjustments. You find them in your current working directory.
In case you are interested in the diversity of your samples, you should look at the “All_Diversity_Stats.tsv2” file that you will also find in the “Results_sample_set” folder. For each sample, you will find for each taxonomic rank the following information: richness, Shannon diversity, Simpson diversity, and evenness. You can also use this file for comparative analysis. For instance, import it into Excel and start your subsequent statistical test.
But what do these values tell you? Let’s start with richness. This is simply the number of different taxonomic units in a rank. For instance, if you find the number 124 for genus richness, this means that 124 different genera were detected in your dataset. The more different taxa are present in your dataset, the more diverse is the observed microbiome.
Diversity has received some attention in the last couple of years because a low diversity in the intestinal microbiome is associated with some diseases. Two different diversity indexes are calculated: Shannon and Simpson index. You can use both. Generally, the higher the value, the higher the diversity. The final value given is the evenness. It simply tells you how close the number of different taxa are in your sample. So if you have the following numbers of genera in an ecosystem (genus A = 100, genus B = 105, genus C = 95), the evenness is high since all genera have almost the same number, no one is dominating. Low values of evenness would indicate the presence of at least one dominant species (for instance genus A = 100, genus B = 1050, genus C = 95). In that way, evenness is also a measure of biodiversity.
Finally, you will also find a folder called “Results_samples” in your current working directory. Here, you can find a great number of additional output files for each sample.
Check list for RDP Classifier
Input Files: Sequence reads in FASTA format with the ending “_HQ.fasta” stored in a folder called “HQ_reads”. The folder needs to be located in the current working directory. FASTA files should be free of chimeric sequences.
Output Files: All output files can be found in the folder “Results_sample_set” in your current working directory. For taxonomic classification, these are CSV files starting with “Taxa”. The assigned taxonomic level is given in the second part of the file name. Diversity data can be found in the file “All_Diversity_Stats.tsv2”. You can open all files with LibreOffice Calc.
Log-Files: “rdp_pipe.log” and “rdp_analyzer.log”. You will find both files in your current working directory. Always check this file for potential error messages.
Storage: Allow for some GB of free disk space. RDP does not need a lot and also the output files are pretty small.
Recommended instance type: At least m5.2xlarge/m4.2xlarge. When your instance is stopped, you can change your instance type by clicking on Actions -> Instance Settings -> Change Instance Type. You might need to request usage of faster instances from the AWS support.
Timeframe: Approx. 2 – 5 minutes per sample (depending on the AWS instance and the sequencing coverage).