Tutorial on How-to-Do Whole Genome Sequencing (WGS) Assembly Statistics with a Graphical User Interface (GUI)
Tutorial video on how to perform statistics with your whole genome sequence (WGS) assembly
Statistical Analysis of your WGS Assembly using the DocMind Analyst
Once you have generated assembly files you can calculate the most important statistics by checking the “Assembly Statistics” box in the “Pipeline Options” panel. If you are running the assembly pipeline, you don’t need to worry about input files since it will automatically be done for you.
Otherwise, you will need to create a folder named either “SPAdes_Final_Assemblies”, “A5_Final_Assemblies”, “Skesa_Final_Assemblies” or “Postprocessed_Assemblies” in your current working directory. You can transfer as many assembled as you wish (endings *.fasta, *.fas, *.fa or *.ffn). After the job is done, you will find the results in all the folders mentioned above. For instance, if you run the pipeline with SPAdes, you will find the final assemblies including the statistics in the “SPAdes_Final_Assemblies” folder. The most important file is named “Assembly_Stats.txt”. In this file, you will find the number of sequences (contigs/scaffolds), the N50 and the GC-content of each assembly.
N50 as a measure for assembly quality
The number of sequences and the N50 are a measure how successful your assembly is. The N50 is defined as the shortest sequences length at 50% of the genome. One example, imagine you have nine contigs with 2, 3, 4, 5, 6, 7, 8, 9 and 10 base pairs. The length of the genome is the sum of base pairs of all contigs, in this case 54 base pairs. 50% of the genome length is 27 base pairs. The three longest contigs (10 + 9 + 8 = 27) represent 50% of the genome length. The shortest one of them is the N50 (here N50 = 8). It appears that a high N50 and a low number of contigs can be considered as a fine assembly. However, when you compare the N50 of different genomes it is important that they would have the same genome size (number of base pairs), otherwise the comparison is unfair. You also need to consider that long contigs do not mean necessarily a good assembly. The assembly can still be inappropriate despite yielding long contigs. These statistics enable you to see how your assembly was generally working and whether you have some assemblies of probable inferior quality that would need more attention.
GC content for detection of bias and contamination
The GC content is a good measure to show you whether some bias appeared during the production of the sequencing library. For instance, you have sequenced 100 strains of Pseudomonas aeruginosa, you would expect their genomes to have an average GC content about 66%. If you see two strains with a GC-content lower than 50%, you can suspect that something went wrong, e.g. a heavy contamination or confusion in picking the correct samples. All these statistics are also graphically displayed as png files (seq_count.png, n50.png, gc_content.png). You find these files in the folder with you final assemblies. These graphs will make it even easier to spot outliers. For more detailed statistics, you can visit the QUAST website by clicking the respective button in the QUAST panel. You can upload your assembly and gain different statistics that further help you to estimate the quality of it.
Multi-Locus Sequence Type (MLST) as Indicator for the Genetic Relatedness of Your Samples
The last two columns of the of the “Assembly_Stats.txt” file are the “PubMLST scheme name” and “sequence type” columns. Generally, MLST stands for Multilocus sequence typing. In molecular biology, this is a way of estimating how far two or more isolates are genetically related with each other. In order to do that, a number of “housekeeping genes” are analyzed. These genes are present in each genome of a species, so you will always find them. Each individual sequence of one of these genes is a distinct allele. The combination of all alleles from all the (usually seven) housekeeping genes – the allelic profile – determines a strain’s sequence type. Strains with the same sequence type have identical housekeeping genes. Hence, there are assumed to be genetically closely related. The mlst tool of Torsten Seeman will extract these housekeeping genes from your assemblies. From the combination of genes that the tool finds, it will determine the PubMLST scheme name. This is essentially the bacterial species from which these genes are derived from. It will then assign a sequence type to that species, unless no sequence type has been found in the databases. In this case, your species might constitute a novel sequence type. However, be aware that not all bacterial species (scheme names) are available. Here is a list of the available scheme names of the latest version:
Cdiphtheriae, cronobacter, mhyopneumoniae, pmultocida_multihost, csepticum, kpneumoniae, cglabrata, miowae, vparahaemolyticus, senterica, vvulnificus, mcatarrhalis, leptospira_3, aphagocytophilum, fpsychrophilum, shominis, afumigatus, lmonocytogenes, Streptomyces, bintermedia, aeromonas, bbacilliformis, kkingae, cfreundii, Bordetella, hsuis, pdamselae, mhaemolytica, vibrio, sepidermidis, brucella, ganatis, clanienae, pgingivalis, mabscessus, mbovis, bhampsonii, pfluorescens, suberis, mplutonius, Neisseria, paeruginosa, sdysgalactiae, hcinaedi, hpylori, hparasuis, ecoli, scanis, bpilosicoli, campylobacter, calbicans, xfastidiosa, Yersinia, cinsulaenigrae, spyogenes, cmaltaromaticum, pacnes, clari, mcaseolyticus, edwardsiella, cdifficile, bcereus, vtapetis, mpneumoniae, lsalivarius, dnodosus, blicheniformis, spseudintermedius, sagalactiae, spneumoniae, ckrusei, kseptempunctata, bhenselae, mmassiliense, leptospira, arcobacter, otsutsugamushi, chlamydiales, borrelia, ppentosaceus, achromobacter, sbsec, ssuis, brachyspira, chelveticus, Wolbachia, mcanis, csputorum, soralis, koxytoca, bsubtilis, cupsaliensis, bhyodysenteriae, sthermophilus_2, csinensis, pmultocida_rirdc, efaecium, tenacibaculum, leptospira_2, bcc, cbotulinum, ecloacae, vcholerae, yruckeri, szooepidemicus, tvaginalis, chyointestinalis, smaltophilia, ypseudotuberculosis, slugdunensis, ranatipestifer, magalactiae, ecoli_2, cconcisus, abaumannii_2, ctropicalis, plarvae, cfetus, hinfluenzae, rhodococcus, kaerogenes, bpseudomallei, shaemolyticus, efaecalis, saureus, msynoviae, orhinotracheale, abaumannii, sinorhizobium, taylorella, mhyorhinis, sgallolyticus, sthermophilus
In case you are not sure which species stands behind these names, have a look at the PubMLST website
Please note that the assignment to a PubMLST scheme name is not a valid species assignment. On the other hand, when you have an expectation which species you have sequenced, and this expectation does not match with the upcoming scheme name, there might be something wrong. In this particular situation it makes sense to check the GC content as mentioned above in order to exclude a sample contamination. In that sense, only make use the MLST sequence type if your sequenced species matches the scheme name. Unless there is no scheme name for the sequenced species. But in that case MLST sequence types are of no use anyway.
Let’s assume that you get the scheme name of your expected species and the mlst tool with provide you with a sequence type. The sequence type will give you an idea whether your strains are genetically close or distinct. However, the resolution is quite low since it is not weighted whether different alleles have occurred due to a single nucleotide substitution or to multiple SNPs. Tools like eBURST can take allelic differences into account and infer phylogenetic relations between different isolate and can visualize them in a convenient way. But this is only reliable in case of bacterial species with low recombination rates. Of course, since you have the full sequence of the strains you can have a deeper look into phylogenetics, for instance with our Core Genome Phylogeny pipeline that uses maximum likelihood approaches to infer phylogeny. But even if the genetic resolution of MLST is relatively low, it gives you a great first impression at one glance. Furthermore, MLST sequence types are still frequently reported in the scientific literature for standardization.
Check list for Assembly Statistics
Input Files: FASTA Assemblies (endings *.fasta, *.fas, *.fa or *.ffn) in a folder named either “SPAdes_Final_Assemblies”, “A5_Final_Assemblies”, “Skesa_Final_Assemblies” or “Postprocessed_Assemblies” in your current working directory.
Output Files: A file called “Assembly_Stats.txt” in the folders mentioned above. Statistics are also graphically displayed as png files (seq_count.png, n50.png, gc_content.png).
Log-File: “Stats.log”. You will find this file in your current working directory. Always check this file for potential error messages.
Storage: 1 – 2 MB for the files mentioned above for each folder.
Recommended instance type: All instances, even the ones with low computational power will work.
Timeframe: Approx. 1 – 5 minutes, mainly depending on the number of assembly files.