Tutorial on How-to-Construct a Core Genome with a Graphical User Interface (GUI)
Tutorial video on how to construct a core genome from whole genome assemblies using Spine
Core Genome Construction using a Spine GUI within the DocMind Analyst Software Suite
Based on you genome assemblies you might be interested to investigate the genetic relatedness between your samples. In order to do that, you will in a first step need to identify genetic regions that are shared by all of your samples, since only these regions should be taken into account for a maximum likelihood phylogeny. Genetic regions that are shared among all of your samples (or the greatest part) are called the core genome. These regions have a high degree of similarity, contain the same genes but are usually not identical. Regions that are not shared among all (or most) of them are called accessory genome. This has been illustrated below. In the first step of the pipeline, a core genome can be constructed.
Genes and intergenic regions within the core genome the can have variations among different samples, and these differences are used to estimate the genetic relatedness between different strains either of the same or between different species. You can determine such single nucleotide polymorphisms (SNPs) or small insertions and deletions (indels) using the high quality reads of the samples. For detailed instructions on SNP calling with the DocMind Analyst, please read this tutorial.
Input File Requirements
In order to construct a core genome, start the DocMind Analyst and navigate into the “Core Phylogeny” Panel (Home -> Phylogeny -> Core Phylogeny). In the first step, the software tools Spine will generate a core genome for you. At first, you must provide assembled genomes in Fasta or GenBank format (*.fasta, *.fas, *.fa, *.gb, *.gbk) in a current working directory that you can chose by using the DocMind Analyst entry mask. In the “Pipeline Options” panel, you must check the “Core Genome Generation” box. You can find a tutorial here that explains how you can assemble genomes from sequencing reads.
Next, you need to choose the options for the core genome generation. As usual, you only need to worry about the basic options on the left side of the menu while you can use the default settings on the advanced side, and you will usually doing great. The first basic option in the “Spine Core Genome Generation Basic Options” panel is the minimum size of core region sequences. When Spine decides that a region is core, these might be pretty small regions, even below 100 base pairs. If you use a very small value here, you will include all core regions but it will increase the risk that regions of a low quality assembly and potential false alignment are included. If you choose larger values, your core genome is certainly more valid. However, if the value is too large you might lose some regions. The default is 500 base pairs and usually a good starting point.
The next option is the “percentage of genomes considered core”. This is important too. If you go with the default (100%) it means that a region must be found in all genomes to be considered core. While this seems logical and concordant with the definition of a core genome, you have to consider that assemblies from short read data are often not complete. For that reason, you can miss a region in your assembly that is present in the sequenced organism. In order to account for such imperfect genome assembly, you can decrease that value (usually 98% – 99 %). The last option is easy since you just need to decide about the name of your core genome (prefix).
Once you have more experience, you can check out the advanced options in the “Spine Core Genome Generation Advanced Options” panel. Here you find the parameter “Minimum percent identity of core regions”. The default is 85%. That means that regions with an at least 85% similarity are considered homologous. Obviously, this is the most important parameter on how Spine decides whether regions are shared among genomes. If you chose a higher value, you will include fewer regions in the core genome but they will be more similar to each other. While this is more specific, you might miss regions that share the same genes but have a high structural variety. On the other hand, decreasing the value would lead to inclusion of regions that might not share the same genes. The default of 85% was evaluated for Pseudomonas aeruginosa but would be a good first choice for other microorganisms too.
The last parameter is the “Maximum distance between segments”. When two core regions are identified they will be reported as 2 different regions, meaning that you will find two contigs in the core genome FASTA file. However, if the gap between two core regions is less than the value specified in that parameter (default are 10 base pairs), Spine will output one fragment (contig) instead of two and will fill in the gap with N’s.
After accepting the Terms of Service, you can start the pipeline by pressing the “Submit Job” button. The pipeline is designed that all files you have provided in your current working directory folder will be processed, so you don’t need to do anything but wait until it is finished.
Module Output Files
A new folder named “Core_Genome_Results” will be automatically generated. In there, you will find the output files, including statistics, log files as well as the core genome itself. The core genome file is called “prefix.backbone.fasta” (the prefix has been chosen by you). In your current working directory, you will find a file called “genome_files.txt“. The first genome at the top of the list is the genome after which the core genome was constructed. It means that the identified core regions will have the same sequence as this reference genome. The core genome will have each core region included as a contig. If you prefer a concatenated version of your core genome (all core genome regions are fused instead being single contigs) just look in your current working directory. You will find such a file named “prefix.backbone.concatenated.fasta” (prefix has been chosen by you). At this point, you have constructed a core genome. Congratulation!
Input Files: Assembled genomes in Fasta or GenBank format (*.fasta, *.fas, *.fa, *.gb, *.gbk) in your current working directory.
Output Files: The core genome file is called “prefix.backbone.fasta” (the prefix has been chosen by you) and you find it in a folder called “Core_Genome_Results”. The concatenated version of the core genome can be found in the current working directory and is called “prefix.backbone.concatenated.fasta” (prefix has been chosen by you).
Log-File: “spine.log”. You will find this file in your current working directory. Always check this file for potential error messages.
Storage: Spine needs some free disk space for temporary files. Allow for at least 5 GB free disk space. This is very important. If you fail to provide enough storage volumes, your analysis will be canceled at some point. You can change the volume size of your instance by selecting the appropriate volume in your AWS console (under volumes) and increase it. Note, you cannot decrease it.
Recommended instance type: At least m5.4xlarge/m4.4xlarge. When your instance is stopped, you can change your instance type by clicking on Actions -> Instance Settings -> Change Instance Type. You might need to request usage of faster instances from the AWS support.
Timeframe: It’s hard to predict since it does not scale linearly with increasing number of genomes. Approx. 10 – 15 minutes for 30 genomes (depending on the AWS instance and the sequencing coverage). But it can take days for several hundreds of genomes. Consider faster instances in this case (like m5.12xlarge or m5.24xlarge).