Whole-genome sequencing data analysis¶
Understanding genetic variations, such as single nucleotide polymorphisms (SNPs), small insertion-deletions (InDels), multi-nucleotide polymorphism (MNPs), and copy number variants (CNVs) helps to reveal the relationships between genotype and phenotype. Currently, high-throughput whole-genome sequencing (WGS) and whole-exome sequencing (WES) are widely used approaches to investigate the impact of DNA sequence variations on human diversity, identify genetic variants associated with human complex or Mendelian diseases and reveal the variations across diverse human populations. There are significant advantages and limitations of both of these techniques, but balancing cost- and time-effectiveness against the desired results helps choosing the optimal sequencing approach. WES may cost less than WGS because it covers only protein-coding regions and generates less raw data, but WGS provides more comprehensive picture of the genome considering both non-coding and coding genomic regions. It also allows to identify SV and CNV that may be missed by WES. Moreover, WGS allows to obtain more uniform and reliable coverage. All in all, WGS is a more universal method than WES.
This tutorial will guide you through the genetic variants discovery workflow on Genestack. We will analyse a dataset by Dogan et al. including high coverage (x35) WGS data from a Turkish individual. The experiment can be found in Public Experiments — regularly updated collection of freely accessible experiments imported from SRA, ENA, GEO, ArrayExpress. Genestack enables you to work on public and private data seamlessly. Import your own sequencing data, mapped reads or genetic variants data with our data importer .
The genetic variants analysis pipeline includes the following steps:
- Quality control of raw reads
- Preprocessing of the raw reads
- Unspliced mapping of the preprocessed reads onto a reference genome
- Post-alignment processing
- Quality control of the mapped reads
- Variant calling
- Variant annotation
- Variant filtering and prioritisation
Raw sequencing assays from the Dogan et al. experiment, all processed data and reports are located in the WGS Data Analysis on Genestack Platform.
Quality control of raw sequencing reads¶
Poorly identified bases, low-quality sequences and contaminants (such as adaptors) in the raw sequencing data can affect downstream analysis leading to erroneous results and conclusions. Before starting the WGS analysis, we will check the initial data quality and decide how to improve the downstream analysis by a variety of preprocessing options. FastQC Report app is based on FastQC tool and produces several statistics characterising the raw data quality: Phred score distribution per base and per sequence, GC content distribution, per base sequence content, read length distribution, and sequence duplication level. We will compute quality control statistics with FastQC Report app for both assays from the dataset. In order to do so, open the dataset in the Metainfo Editor, click on the Analyse button and select from the list of suggested options the “FastQC Report” app.
Besides, feel free to run the “Raw Reads Quality Control” public data flow which you can find on the Dashboard. The app page presents the quality control part of the pipeline in a graphical form. To generate QC-reports click on the Run Data Flow button and, then, on Start initialization now.
If you don’t want to generate QC-reports now, click Delay initialization till later button.
In this case you can start initialization, for example, from Multiple QC Report apps allowing to explore obtained results for both samples at the same time.
The calculations can be started directly from the Multiple QC Report app page by clicking “(re)-start computation if possible”.
Follow the progress of your tasks in Task Manager.
When the computations are finished, QC reports for both sequencing runs will appear on the Multiple QC app page. Explore reports for each individual assay in FastQC Report app by clicking on the app or file name in the Task Manager. Alternatively, go to Created files folder and look for a folder containing the files created for “Raw Reads Quality Control” data flow. To describe raw reads quality control statistics we will use the reports from our tutorial folder previously prepared by our team. To start with, we will open both of them in Multiple QC Report app that interactively represents QC statistics for several raw assays at once.
You can select samples of interest, for example ones that are suitable for further analysis, and put them in the separate folder by click on the New folder with selection button.
For paired reads the quality control report contains statistics such as total nucleotide count, GC content, number of reads, and number of distinct reads. Using the Multiple QC Report app you can sort assays using QC-keys mentioned above and metainfo-keys, such as “method” or “organism”. Now when we have the general impression of quality of raw reads we can go deeper and get a more detailed statistics using FastQC report for each individual sequencing run.
FastQC report contains several quality control metrics outlined below:
- Basic statistics of raw data, for example the total number of reads processed, and GC content;
- Sequence length distribution describing * * the distribution of fragment sizes in the analysed sequencing assay;
- Per sequence GC content plot displaying the GC content across the whole length of each individual read;
- Per base sequence quality plots depicting the range of quality scores for each base at each position in the analysed sequencing assay;
- Per sequence quality scores plot allowing the detection of poor quality sequences in the total sequences;
- Per base sequence content plots representing the relative number of A, C, T, and G for each position in the tested sample;
- Sequence duplication level plots representing the proportion of non-unique sequences which could be present in the library;
- Overrepresented sequences providing the information on sequences that make up more than 0.1 % of the total, and may either have a high biological significance or indicate contamination of the library.
Table located on the left side of the page informs us which reports raise concerns or report failures. In this case it is the Per base sequence content, Sequence duplication levels * and *Overrepresented sequences metrics. Raw data for both sequencing runs failed the per base sequence content metric. Ideally, in a random library we would see four parallel lines representing the relative base composition. Fluctuations at the beginning of reads in the tested sample may be caused by adapter sequences or other contaminations of the library.
The warning reported for the sequence duplication * metric for the first sequencing run indicates that the number of non-unique sequences in the assay has reached more than 20 % of the total. The average duplication levels for read mates are 1.50x and 1.48x. *Sequence duplication plot represents the relative number of sequences having different duplication levels, and for WGS experiments, generally characterised by even coverage, this graph should quickly drop to zero. Duplicates could correspond to PCR amplification bias generated during library preparation or reading the same sequence several times.
Lastly, according to the reports, the first sequencing run compared to the second one contains some over-represented sequences — sequences that are highly duplicated in a sample. In total, the app identified 1,052,139 sequences consisting of ‘N’-bases.
The mentioned issues could be fixed by performing appropriate preprocessing of the raw data. In this case, we will trim low quality bases at the read ends and remove adaptors and contaminants. Moreover, we will filter reads by quality score, so that in further analysis we will only consider reads with high quality (average Q≥20) score. Despite differences in the raw data quality, we will apply the same preprocessing steps to both samples. It should be stressed that after any applied preprocessing step you can check its influence on the quality of raw reads using the FastQC app.
Now that we have checked the quality of sequencing assays and decided on the appropriate preprocessing steps, it is time to create the pipeline for genetic variants analysis of WGS data from the raw data preprocessing to the genetic variants annotation and filtering.
Building the genetic variants analysis pipeline¶
To start the pipeline, open the “Homo sapiens Genome sequencing and variation” dataset in the Metainfo Editor, click Analyse, and select the first preprocessing app — Trim Adaptors and Contaminants.
On the Trim Adaptors and Contaminants app page you can explore the list of created and source files, edit parameters and continue building the pipeline.
You can also choose the app to explore results, for example discover how this step affects the initial quality of raw reads with FastQC Report app. If you want to learn more about the application, click on its name and go to About application.
You can use the created trimmed files as inputs for other applications. Let’s click on Add step button and select the next preprocessing app — Trim Low Quality Bases. This will take you to the Trim Low Quality Bases app page. Proceed in the same way and add all the desired steps to the pipeline until you reach the final one — Effect Prediction.
Don’t forget to set the parameters for each app in the pipeline and select appropriate reference genome, in this case H. sapiens reference genome (GRCh37.75), that will be used by Unspliced Mapping with BWA, Variant Calling and Effect Prediction apps. You can return to any added app by clicking on the name of app we are interested in. The very final output file containing genetic variants and their possible effects on annotated genes can be opened with Variant Explorer and Genome Browser apps.
To be able to re-use manually built pipeline you could create a data flow. Click on the resulting file name on the final Effect Prediction app page and go to Manage and Create new Data Flow.
The created data flow will be opened in the Data Flow Editor, where the pipeline for genetic variants investigation using WGS is graphically represented. Click on the Run data flow button to create all files for the pipeline.
This will take you to the Data Flow Runner page. To run the pipeline click on the Run Data Flow button at the bottom of the data flow.
After that you will be suggested to either start the computation now or delay it till later:
We will postpone the analysis and focus on each step of the WGS data analysis pipeline. Later we can start initialization directly from one of the suggested apps, such as Variant Explorer, Genome Browser or Effect Prediction.
You can verify processing parameters on each individual app pages before running the pipeline. To do this, click on Edit file list and open the file using the app that created this file:
Data Flow Runner allows you to start initialization up to any step of the pipeline. We recommend you check the mapping quality after removing the duplicates from mapped reads to assure that they could be further used in variant calling and effect prediction. In order to do this, click on 2 files in Remove Duplicated Mapped Reads section and start initialization with right-click context menu. Follow the process in the Task Manager. Regardless of the status of the analysis all the created data flow files will be located in the corresponding folder in the Created files folder.
Note that there is a data flow file including all the mentioned preprocess and analysis steps previously prepared by Genestack team. This data flow is called WGS data analysis for Dogan et al. (2014) and you can find in our tutorial folder. Now let’s talk about each of the analysis steps we included in the pipeline in greater detail.
Preprocessing of the raw reads¶
Often overlooked, preprocessing of raw data is essential due to the fact that it improves the original data quality, and consequently, the results of the downstream analysis. To prepare raw reads for variant calling and annotation we will run several preprocessing apps: Trim Adaptors and Contaminants, Trim Low Quality Bases and Filter by Quality Score. Firstly let’s explore the parameters of the Trim Adaptors and Contaminants app.
Trim Adaptors and Contaminants app finds and clips adapters and contaminating sequences from raw reads. The authors of the experiment trimmed and filtered reads with Trimmomatic so that reads with a high quality score and a minimum length of 36 bp after trimming were kept. We will apply default parameters ejecting reads below a length of 15 bp. You can change the minimum length of trimmed sequence on the app page
The next read preprocessing procedure we plan to do is removing bases of low quality with Trim Low Quality Bases app based on seqtk 1.0 tool. It removes nucleotides of a low quality from the raw data according to phred33 score that encodes the probability that the base is called incorrectly. Currently, this app does not support any changeable command line options.
We will finalize the data preprocessing by filtering of trimmed reads by quality score. The app filters out reads from input file according to the set value of Phred33 quality score. As usual, you can change the default parameters on the app page. We will eliminate all reads with quality score below 20, considering only the bases called with 99 % accuracy. By default “Minimum quality score” is already equal to “20”, so we only need to set the command line option “Percentage of bases to be above the minimum quality score” to “100”.
Check the quality of the preprocessed reads with FastQC Report app to assure that it is satisfactory or make decisions about additional preprocessing steps prior further analysis. After we have completed all the preprocessing steps, our raw reads are of better quality, and now it is time to begin the mapping of analysis-ready reads onto the human reference genome.
Unspliced mapping reads onto a reference genome¶
To map preprocessed reads to the reference genome we will use the Unspliced Mapping with BWA app which with high efficiency and accuracy alines sequencing reads against a whole reference genome without allowing large gaps, such as splice junctions. The BWA-based aligner currently has a hardcoded command line.
Prior to the variant discovery we would recommend you to check the quality of mapped reads because some issues, such as low coverage, homopolymer biases or experimental artifacts, only appear after the alignment. One of such quality metric is sequencing coverage depth that could determine the confidence of variant calling. The deeper the coverage, the more reads are mapped on each base and the higher the reliability and the accuracy of base calling. The assays from the Turkish individual were obtained with high coverage (35x) WGS. Let’s explore read coverage for generated mapped reads for both runs interactively in Genome Browser, which allows navigation between regions of the genome. To learn more about navigating in Genome Browser look at our blog post.
Remove duplicated mapped reads¶
Sometimes due to errors in the sample or library preparation, reads may come from the exact same input DNA template and accumulate at the same start position on the reference genome. Any sequencing error will be multiplied and could lead to artefacts in the downstream variant calling. Although read duplicates could represent true DNA materials, it is impossible to distinguish them from PCR artifacts, which are results of uneven amplification of DNA fragments. To reduce this harmful effect of duplicates prior to variant discovery we will run Remove Duplicated Mapped Reads app based on Picard MarkDuplicates tool. To determine duplicates Picard MarkDuplicates uses the start coordinates and orientations of both reads of a read pair. Based on the identical 5’mapping coordinates it discards all duplicates with the exception of the “best” copy.
Quality control of mapped reads¶
As you remember, we run just a part of the pipeline including preprocessing, alignment and removing of duplicates to check if the mapping quality is good enough and we can move on to variant calling and annotation.
Post-mapping quality control is not necessary, but is a very important step. The mapped Reads QC Report app produces various QC-metrics such as base qualities, insert sizes, mapping qualities, coverage, GC bias and more. It helps to identify and fix various mapping issues and make downstream processing easier and more accurate. Find the created filtered mapped reads (the outputs of Remove Duplicated Mapped Reads app) in the Created files folder.
As in the case of raw reads quality control, you may explore results not only in Mapped Reads QC Report app itself, but also compare the mapping quality of both tested assays with Multiple QC Report app. Report appears on the page as the computation is finished.
Let’s look at the example report for the two sequencing runs from our experiment. Go to the tutorial folder and open QC reports for both mapped reads files in Multiple QC Report app. Use the drop-down menu “Select QC keys to display” and “Select metainfo to display” to specify which QC-metrics and sample associated information you wish to see on the plot.
According to the QC check, both technical replicates from our experiment are concordant with all reads being mapped and 95 % of the reads are mapped properly. To obtain more detailed statistics explore individual QC report in Mapped Reads QC Report app. Let’s explore the mapping quality for the first sequencing run of Turkish individual sample. On the app page you will find mapping statistics such as, for example, numbers of mapped, partially mapped, unmapped mate pairs. Besides general mapping statistics individual QC report contains coverage by chromosome plot, and, for paired-end reads, some statistics on insert size and and insert size distribution plot. As we can see, the median insert size is 364 with standard deviation equal to 66.99.
Insert size distribution plot displays the range lengths and frequencies of inserts (x- and y-axis, respectively) in the analysed assay.
After ensuring that our mapped reads are of high enough quality, we can move on to the final stages of our analysis pipeline — variant identification and effect prediction. Now then, let’s finalize the computations of the pipeline. Make sure to check the parameters of Variant Calling and Effect Prediction apps and start initialization of the rest of the files.
Experimental biases could lead to errors in variant calling mimicking true genetic variants. Variant calling on multiple samples helps increase the accuracy of the analysis by taking the reads from several samples into consideration and reducing the probability of calling sequencing errors. We run Variant Calling app on analysis-ready mapped reads for both technical replicates with default parameters that always could be changed on the Variant Calling app page. In the picture below you can see source files (reference genome and both filtered mapped reads files) and default command line options.
Track the progress of your tasks in Task Manager and as soon as the computation is finished, explore the results of variant identification using the interactive applications such as Variant Explorer or Genome Browser. Now, let’s take a look at the results of variant calling in the Genome Browser. Let’s click on the genetic variants file name in Task Manager and open it in Genome Browser using the context menu. In case some files are not initialized yet, Genome Browser page is empty. You can initialize the files by clicking on Go! to start the process. Tracks representing found mutations will appear on the page as the task is finished. The reference track displaying annotated genes with their coordinates and variation track representing genetic variants, their genomic position, average mapping quality and raw read depth.
Zoom in to explore genetic variants in single-base resolution. For example, looking at the region 1:725878-725972 (95 bp) we can see several SNPs (red) and one deletion 5bp long (blue).
After variants have been identified, we can annotate them and identify the effects they produce on known genes with Effect Prediction app. The app is based on SnpEff tool that annotates variants based on their genomic location: intronic, untranslated regions (5′UTR or 3′UTR), upstream, downstream, splice site, or intergenic regions. It also determines effects genetic variants have on genes, such as amino acid replacement or frame shifts. Remember, if you have some files uninitialized, you can run the analysis on the Effect Prediction page or Data Flow Runner page.
Let’s now analyse the annotated variants in the genome of the Turkish individual. You can do this with Report Viewer application: right click the “Variants with predicted effects for Dogan et al. (2014)” file name and go to View Report. Note that you can also explore annotated variants in Genome Browser and Variant Explorer apps.
First of all, the report summary contains some basic information about the analysed file.
In general 4,389,254 mutations were found in our assay with one change every 7,014 bases. The most common variants are SNPs that make up 3,835,537 from the total. The second most abundant genetic variation type after SNPs are Indels. Insertions and deletions were found in 252,548 and 301,169 change cases, respectively. According to the paper, the authors identified 3,642,449 and 4,301,769 SNPs using Casava and GATK workflows, respectively. However in the downstream analysis they used 3,537,794 variants identified by both methods.
Insertion deletion length histogram graphically demonstrates the distribution of length of all insertions and deletions. The discovered Indels ranged from -43 to 28 bp in length with the standard deviation of 5.256. Authors detected 713,640 InDels (341,382 insertions and 372,258 deletions) ranging from −52 bp to 34 bp in length.
Additionally, we performed filtering by effect to find out InDel distribution throughout different genomic locations. From identified InDels 258680 and 263835 were in an intergenic and intronic region, respectively. We also found 69426 InDels in the upstream and 74162 InDels in the downstream gene regions. Only 69 and 78 mutations were detected in the splice site donor and in splice site acceptor, respectively. Finally, we detected 6241 insertions and deletions in exons. Besides the statistics on the change type of the found mutations, report also contains quality and coverage information. Quality histogram shows quality distribution with minimum value of 3 and maximum value of 486 for the analysed data:
The following histogram shows coverage. For our data the mean coverage is 28.882 while the maximum coverage is 8,026.
For all the identified genetic variants the app also calculates associated effects and prioritises them by putative biological impact.
For example, if a found mutation leads to a protein truncation, then it could have a high and disruptive effect on the gene function. However, variants that influence only the protein effectiveness will most likely have only a moderate effect, and synonymous variants that will unlikely change the protein behaviour will probably have low effect. Variants affecting non-coding genes are considered as modifiers. It is important to remember that grouping doesn’t guarantee that it is the high-impact effect variants that are responsible for the analysed phenotype. Genetic variants could have various effects on the genome for instance they could result in codon changes, insertions or deletions, frame shift mutations etc. Genetic variants can affect different genomic regions such as exons, intergenic regions, introns, untranslated regions, splice sites, upstream and downstream regions. As we can see from the report most changes in the Turkish individual genome are located in intronic regions (63,351 % of the total).
As we can see the vast majority of identified variations are associated with introns (climbed above 60 %) and there is no mutations in splice sites. The changes in intergenic regions represent ~17 % of the total, while changes in exons occur in approximately 2 % of events.
The most frequent base changes is G to A with 651,754, followed by C to T (650,016), T to C (621,506) and A to G (620,959) base changes.
The quality of SNP data could be characterised with transition/transvertion (Ts/Tv) ratio that for whole human genome is typically about 2. Note that this ratio is not universal and could vary with regions, for example it is higher for exons.
Our results are in line with the original paper by Dogan et. al where they have identified 2,383,204 transitions, 1,154,590 transversions resulting in Ts/Tv ratio of 2.06. Next entry of the report is the codon replacements table (we have posted a fragment of it below). Rows represent reference codons and columns represent changed codons. The most common codon change for our data is from GAC to GAT (876 events) resulting in a synonymous change.
The report also contains the amino acid changes table where reference amino acids are represented by rows and changed amino acids are represented by columns. For example, row ‘A’ and column ‘E’ show how many Ala have been replaced by Glu. The most common amino acid changes are Ala to Thr, 722 times, followed by 693 changes from Ile to Val events, and 780 Val to Ile events.
Apart from the mentioned statistics and plots, report also contains allele frequency plots and information on the change rate per chromosome.
Genetic variants filtering¶
Resulting genetic variants files, annotated or not, can be opened in the Variant Explorer app. In the Variant Explorer you can interactively explore the information about found mutations, as well as sort and filter them by specific factors such as: locus, type of variants (SNP, INS, DEL, MNP), reference or alternative allele, Phred-scaled probability that the alternative allele is called incorrectly, and for annotated variants by their predicted effect, impact and functional class. Besides that, the app computes genotype frequencies for homozygous samples with reference and alternative alleles (GF HOM REF and GF HOM ALT columns, respectively), reads depth for homozygous samples with alternative allele (DP HOM ALT) and reads depth for heterozygous samples (DP HET). To prioritise found mutations open an annotated genetic variants file in the Variant Explorer: right-click on the resulting file name in the Data Flow Runner, Task Manager or File Browser and select Variant Explorer in the context menu. In total 4,361,389 variants were found.
Let’s now use the filters to see how many of these are high impact variants. Set the filter “Impact” to “high”. As we can see out of all the identified variants 1007 have a high impact.
Let’s now see how many of these are nonsense mutations by applying “Functional class” filter. And now out of all the high impact variants, 154 are nonsense mutations.
Let’s see how many of those are found on chromosome 10 by specifying the chromosome in the “Locus”. Turns out on chromosome 10 there only one variant change that is high impact nonsense mutation. This base change is located in CTBP2 gene, and result in a premature stop codon.
These are all of the steps of WGS data analysis pipeline. You can use files from our tutorial folder to reproduce the results. Feel free to perform further prioritisation, play with filters in Variant Explorer to get more information. For example, you may want to find out, how many InDels results in frame-shift, codon deletion or explore variant distribution on any region of interest etc. In summary, our analysis allowed to identify 3,835,537 SNPs. We also identified 252,548 insertions and 301,169 deletions ranging from -43 to 28 bp. Although our results are in concordance with original paper, there are also some differences in number of identified mutations or InDel length distribution we mentioned above. Such variation could be explained by the use of different tools. For example, authors identified variants with the vendor-supplied Eland-Casava pipeline and The Genome Analysis Toolkit (GATK v2.2), while we used Variant Calling application based on SAMtools and BCFtools.
This is the end of this tutorial. We hope you found it useful and that you are now ready to make the most out of our platform. If you have any questions and comments, feel free to email us at email@example.com. Also we invite you to follow us on Twitter @genestack.