Whole-exome sequencing data analysis¶
As one of the widely used targeted sequencing method, whole-exome sequencing (WES) has become more and more popular in clinical and basic research. Albeit, the exome (protein-coding regions of the genome) makes up ~1 % of the genome, it contains about 85 % of known disease-related variants (van Dijk E.L. et al, 2014), making whole-exome sequencing a fast and cost-effective alternative to whole genome sequencing (WGS).
In this tutorial we’ll provide a comprehensive description of the various steps required for WES analysis, explain how to build your own data flow, and finally, discuss the results obtained in such analysis.
Setting up an exome sequencing experiment¶
First off, let’s choose exome sequencing data. You can upload your own data using Import button or search through all public experiments we have on the platform. Our analysis will be based on data coming from Clark et al. 2011. Let’s find this experiment in the platform and open it in Metainfo Editor:
The authors compared the performance of three major commercial exome sequencing platforms: Agilent’s SureSelect Human All Exon 50Mb, Roche/Nimblegen’s SeqCap EZ Exome Library v2.0 and Illumina’s TruSeq Exome Enrichment; all applied to the same human blood sample. They found that, the Nimblegen platform provides increased enrichment efficiency for detecting variants but covers fewer genomic regions than the other platforms. However, Agilent and Illumina are able to detect a greater total number of variants than Nimblegen platform. Exome sequencing and whole genome sequencing were also compared, demonstrating that WES allows for the detection of additional variants missed by WGS. And vice versa, there is a number of WGS-specific variants not identified by exome sequencing. This study serves to assist the community in selecting the optimal exome-seq platform for their experiments, as well as proving that whole genome experiments benefit from being supplemented with WES experiments.
Whole-exome sequencing data analysis pipeline¶
A typical data flow of WES analysis consists of the following steps:
- Quality control of raw reads
- Preprocessing of raw reads
- Mapping reads onto a reference genome
- Targeted sequencing quality control
- Quality control of mapped reads
- Post-alignment processing
- Variant calling
- Effect annotation of the found variants
- Variant prioritisation in Variant Explorer
Let’s look at each step separately to get a better idea of what it really means.
Quality control of raw reads¶
Low-quality reads, PCR primers, adaptors, duplicates and other contaminants, that can be found in raw sequencing data, may compromise downstream analysis. Therefore, quality control (QC) is essential step in your analysis to understand some relevant properties of raw data, such as quality scores, GC content and base distribution, etc. In order to assess the quality of the data we’ll run the Raw Reads QC data flow:
Genestack FastQC application generates basic statistics and many useful data diagnosis plots. Here is some of them for sample enriched by Aligned SureSelect 50M:
Basic statistics tells you about basis data metrics such as reads type, number of reads, GC content and total sequence length.
Sequence length distribution module reports if all sequences have the same length or not.
Per sequence GC content graph shows GC distribution over all sequences. A roughly normal distribution indicates a normal random library. However, as in our case, if the data is contaminated or there are some systematic bias, you’ll see an unusually shaped or shifted GC distribution:
Per base sequence quality plots show the quality scores across all bases at each position in the reads. By default you see low quality zone and mean quality line. If the median is less than 25 or the lower quartile is less than 10, you’ll get warnings.
Per sequence quality scores report allows you to see frequencies of quality values in a sample. The reads are of good quality if the peak on the plot is shifted to the right, to the max quality score. In our case, almost all of the reads are of good quality (>30):
Per base sequence content plots show nucleotide frequencies for each base position in the reads. In a random library, there could be only a little difference between A, T, C, G nucleotides, and the lines representing them should be parallel with each other. The black N line indicates the content of unknown N bases which shouldn’t be presented in the library. In our case, you can notice the absence of unknown nucleotides and a slight difference in A-T and G-C frequencies:
Sequence duplication levels plots represent the percentage of the library made up of sequences with different duplication levels. In simple words, 44 % of reads are unique, 26 % of reads are repeated twice, 13 % - three times, 4 % - more than 10 times, etc. All these duplicates are grouped to give the overall duplication level. You can use Filter Duplicated Reads application to remove duplicates in raw reads data, however we’ll get rid of them after mapping step.
Application also detects overrepresented sequences that may be an indication of primer or adaptor contamination. We have run QC on all the data in the experiment and put the reports in Raw reads QC reports for Clark et al (2011) folder, so that you can open all of them in Multiple QC Report application to analyse results:
You see that total number of exome sequencing reads is 124,112,466 for Agilent SureSelect, 184,983,780 for Nimblegen SeqCap and 112,885,944 for Illumina TruSeq platform. The whole genome library yielded more than one billion total raw reads.
Preprocessing of raw reads¶
With the comprehensive raw reads QC reports generated by FastQC app, you’re able to determine whether any preprocessing steps such as trimming, filtering, or adaptor clipping are necessary prior to alignment. Here is the list of all preprocess apps that Genestack suggests you to improve the quality of your raw reads:
Our preprocessing procedure will include ‘Trim Adaptors and Contaminants’ step. Once the quality of raw data has been checked, let’s start planning and building our Whole Exome Sequencing Analysis data flow:
To build any data flow in Genestack, choose one of the samples and start to preprocess or analyse it. Each app suggests you to add next analytical step or use relevant viewers:
Note that you can create as many files as you want and run the computation process later. Now let’s create a data flow from the pipeline we built. For the last created file choose Create New Data Flow in Manage section:
This takes us to the Data Flow Editor app page where you can rename, describe your pipeline and change sources. Let’s ‘clear files’ and click ‘Run dataflow’.
Now we’re on the Data Flow Runner application page. Just choose sources — experiment assays and human reference genome — and click Run Data Flow. Note that if you choose several raw reads files, the multi-sample variant calling will be performed. However, in order to compare our results, we need to run this data flow for each sample separately. After that, the app suggests you to choose the explore app where you can start initialization now for whole your analysis or delay it till later:
Let’s delay it. After that the app suggests you to choose the app where you can also start computation:
In order to start computation for each data flow step separately, click on file name and choose Start initialization.
All the data are preprocessed and stored in Trimmed raw reads for Clark et al (2011) folder.
Mapping reads onto a reference genome¶
After raw data QC and preprocessing, the next step is to map exome sequencing data to the reference genome with high efficiency and accuracy. Genestack supports two Unspliced mappers: one is based on Bowtie2, another uses BWA alignment package. We’ll use the last one since it is fast and allows gapped alignments which are essential for accurate SNP and indels (insertion/deletions) identification. The following video illustrates how to start computation on this step and analyse mapping results in Genome Browser:
When mappings are complete, open all four files in Genome browser to compare their read coverage. Let’s look for specific gene or region, for example, HBA1 and HBA2 genes encoding alpha-globin chains of hemoglobin. With WGS technology, you can see coverage in both protein-coding and non-coding sequences:
As for WES technology, you are interested only in exome. That’s why, you see coverage for HBA1 and HBA2 coding regions and do not see it in non-coding ones. To compare read coverage between different enrichment platforms, you can build a coverage track:
In most cases you’ll see a significant coverage for sample enriched by Nimblegen. Moreover, each platform targets particular exomic segments based on combinations of the RefSeq, UCSC, Ensembl and other databases. That’s why, you may expect difference in coverage for specific gene-coding regions. To further use mapped reads, go to the Mapped reads for Clark et al (2011) folder.
Targeted sequencing quality control¶
Besides quality control of the raw sequencing reads, it is also crucial to assess whether the target capture has been successful, i.e. if most of the reads actually fell on the target, if the targeted bases reached sufficient coverage, etc. However, by default the application allows you to compute enrichment statistics for reads mapped only on exome. If you go to the app page, change the value to ‘Both exome and target file’ and select the appropriate target annotation file, you get both exome and/or target enrichment statistics. To do this step, you can ‘generate reports’ for each mapping separately or run our Targeted Sequencing Quality Control public data flow (with default values) for several samples at once and analyse the output reports in Multiple QC Report app:
In this tutorial, we are looking at three exome enrichment platforms from Agilent, Nimblegen and Illumina and assessing their overall targeting efficiency by measuring base coverage over all targeted bases and on-target coverage for each platform:
A typical target-enrichment WES experiment results in ~90 % of target-bases covered at coverage ≥ 1x. This value tends to decrease as the coverage threshold increases. How fast this percentage decreases with the coverage increment depends on the specific experimental design.
Not surprisingly, all the technologies give high coverage of their respective target regions, with the Nimblegen platform giving the highest coverage: about 94 % of the targeted bases were covered at least twice, 93 % at ≥ 10x and 87 % at ≥ 50x. With Agilent, 91 % of bases were covered at ≥ 2x, 86 % at ≥ 10x and 66 % at ≥ 50x. With Illumina TruSeq enrichment, 91 % of bases were covered at ≥ 2x, 86 % at ≥ 10x and only 50 % at ≥ 50x. These results are very similar to the paper results (Clark M.J. et al, 2011):
Regarding the overall percentage of reads mapped on the target, in a typical experiment one may expect ~70 %. Looking at the plot, you see the highest 77 % and 74 % values for samples enriched by Nimblegen and Agilent platforms, respectively. For Illumina TruSeq, on the other hand, only 48 % reads are mapped on the target region. Also, it’s not surprising that we notice the biggest mean coverage on target with ≥ 2x coverage for Nimblegen samples, since this platform contains overlapping oligonucleotide probes that cover the bases it targets multiple times, making it the highest density platform of the three. Agilent baits reside immediately adjacent to one another across the target exon intervals. Illumina relies on paired-end reads to extend outside the bait sequences and fill in the gaps (Clark M.J. et al, 2011):
Target annotations used in this tutorial can be found in Public Data, Genome annotations folder or in Target Annotations for Clark et al (2011) tutorial folder, where we put them for your convenience.
Besides the target enrichment statistics, you can assess the percentage of exome bases with coverage started from ≥ 2x and the overall proportion of reads mapped on exome:
All targeted sequencing QC reports are collected in Mapped reads enrichment reports for Clark et al (2011) folder.
Quality control of mapped reads¶
The reads may look OK on the Raw Reads quality control step but some biases only show up during mapping process: low coverage, experimental artifacts, etc. The detection of such aberrations is an important step because it allows you to drive an appropriate downstream analysis.
You can ‘generate reports’ for each mapping separately or just run Mapped Reads Quality Control data flow for multiple samples and analyse the output reports in Multiple QC Report app:
Output report includes mapping statistics such as:
- Mapped reads: total reads which mapped to the reference genome;
- Unmapped reads: total reads which failed to map to the reference genome;
- Mapped reads with mapped mate: total paired reads where both mates were mapped;
- Mapped reads with partially mapped mate: total paired reads where only one mate was mapped;
- Mapped reads with “properly” mapped mate: total paired reads where both mates were mapped with the expected orientation;
- Mapped reads with “improperly” mapped mate: total paired reads where one of the mates was mapped with unexpected orientation.
The Coverage by chromosome plot shows a read coverage at each base on each chromosome and patch (if it is presented) defined by lines in different colours:
If your reads are paired, the application additionally calculates insert size statistics, such as median and mean insert sizes, median absolute deviation and standard deviation of insert size. The Insert size distribution plot shows the insert size length frequencies:
All complete QC reports for mapped reads are stored in Mapped reads QC reports for Clark et al (2011) folder. You can open all of them at once in Multiple QC Report app to interactively analyse and compare mapping statistics between samples:
Speaking of mapping results, for each sample, almost all of the reads is mapped properly and there is a small percentage of partially or improperly mapped reads.
After mapping reads to the reference genome, it’s recommended to remove duplicates before variant calling, with the purpose of eliminating PCR-introduced bias due to uneven amplification of DNA fragments. That’s why we run Remove Duplicated Mapped Reads app.
Preprocessed mapped reads are stored in Filtered mapped reads for Clark et al (2011) folder.
After duplicate removal, the next step is to identify different genomic variants including SNVs, indels, MNVs, etc. For this, we’ll use Variant Calling application based on samtools mpileup:
The app automatically scans every position along the genome, computes all the possible genotypes from the aligned reads, and calculates the probability that each of these genotypes is truly present in your sample. Then genotype likelihoods are used to call the SNVs and indels.
We run Variant Calling with default parameters, identifying multi-allelic SNPs and indels, excluding non-variant sites and not considering anomalous read pairs. Maximum read depth per position was set as 250 and minimum number of gapped reads for an indel candidate is 1. Base alignment quality (BAQ) recalculation is turned on by default. It helps to rule out false positive SNP calls due to alignment artefacts near small indels. For more information about the app and its options, click on the app name and then on About application.
When files will be complete, you can analyse variants in Genome Browser:
Genome Browser application allows you investigate the variants interactively: how many mutations are in this particular gene or region, review some information about detected variants such as average mapping quality and raw read depth and compare variants enrichment between samples. Analysing variants in Genome Browser, you can notice a large amount of both exome WES–specific and WGS-specific SNVs. We identified variants for each sample separately and put them in Variants for Clark et al (2011) folder.
After variants are detected, use Effect Annotation application based on SnpEff tool. The app annotates variants and predicts the effects they produce on genes such as amino acid changes, impact, functional class, etc. To review this information, open Variants with predicted effects in View report application:
Let’s analyse annotated variants for sample enriched by Nimblegen. Output report contains summary about tool version, number of variants, number of effects, change rate, and other information.
Change rate details table shows length, changes and change rate for each chromosome and patch (if they are presented). Here is the change rate details for the first 10 chromosomes:
The app calculates total number of variants as well as number of homo/hetero single nucleotide polymorphisms (SNPs), multiple nucleotide polymorphisms (MNPs), insertions (INS), deletions (DEL), combination of SNPs and indels at a single position (MIXED) and records it in Number of changes by type table:
SNVs represent the most numerous sequence variations in the human exome. TruSeq detected the highest number of SNVs followed by Agilent and Nimblegen. Most of them are SNPs. For example, in the Nimblegen sample, there are ~555,000 of SNPs and ~40,000 of both insertions and deletions. No significant difference in the ratio of heterozygous to homozygous variants between platforms was observed. However, regarding WGS sample, much more variants were detected (3,8 million of SNPs and about 600,000 indels).
Number of effects by impact table shows count and percentage of variants that have high, low, moderate impact or tagged as modifiers:
As a rule, the mutation has high impact if it causes significant changes such as frame shift, stop codon formation, deletion of a large part (over 1 %) of chromosome or even the whole exon, etc. Variants with low impact do not change function of the protein they encoded. Usually this is synonymous mutations. Moderate variants do not affect protein structure significantly but change effectiveness of the protein function. They mostly include missense mutations, codon deletions or insertions, etc. So called modifiers are mutations in introns, intergenic, intragenic and other non-coding regions.
For sample enriched by Nimblegen, just about 0.04 % of all annotated variants has high impact. However more than 97 % mutation are modifiers. We see the same percentage of modifiers in WES and WGS samples.
Also, the output report contains information about the count and percentage of missense, nonsense and silent mutations. Find out this in Number of effects by functional class table:
For Nimblegen sample, the app detected ~50 % point mutations in which a single nucleotide change results in a codon that codes for a different amino acid (missense mutations). There are more then 50 % of silent mutations which do not significantly alter the protein. And only ~0.3 % are nonsense mutations. They change codon in a way resulting in a stop (or nonsense) codon formation, and subsequently a truncated, incomplete, and usually nonfunctional protein is produced. Almost the same percentage of missense, nonsense and silent mutations we notice for other WES and WGS samples.
Next Number of effects by type and region table outputs how many variants for each type (codon deletion, codon insertion, etc) and for each region (e.g. exon, intron) exist:
Variations histogram additionally illustrates what regions of genome are mostly affected:
Most of variants are detected in the introns. That can be explained by the fact that platform baits sometimes extend farther outside the exon targets.
Quality histogram, like this one below, shows you the distribution of quality scores for detected variants:
This one is asymmetrical, there are more then 160,000 variants with quality of 10 and a lot of small peaks of lower and greater qualities.
Also, the application reports a histogram of Coverage for detected variants:
All variants have coverage 2 and more.
Next Insertions and deletions length histogram shows size distribution of detected indels:
For Nimblegen sample, we identified more than 40,000 indels, of which ~24,000 were deletions of up to 12 bases and the rest were insertions of up to 12 bases. There are more indels were identified after Illumina TruSeq enrichment (~80,000) followed by Agilent (~57,000) and Nimblegen platforms. These findings agree with paper results:
Moreover, most insertions and deletions were 1 base in size. Notably, there is a slight enrichment at indel sizes of 4 and 8 bases in the total captured DNA data, and that is also consistent with paper results (Clark M.J. et al, 2011; Mills R.E. et al, 2006).
In Base change (SNPs) table, the app records how many and what single nucleotide polymorphisms were detected:
There is a slight increase in G→A/C→T transitions and slight decrease in G→C/C→G transversions in both whole exome and whole genome samples.
Transition vs transversions (Ts/Tv) section is about the number of transitions, number of transversions and their ratio in SNPs and all variants.
Transitions are mutations within the same type of nucleotide — pyrimidine-pyrimidine mutations (C↔T) and purine-purine mutations (A↔G). Transversions are mutations from a pyrimidine to a purine or vice versa. The table represents these values taking into account only SNP variants.
But below the table, you can find the information for all variants. For WES data, the Ts/Tv ratio of total variants ranged from 1.6 to 1.8 and was lower than the estimated ~2.6. It can be explained by the fact that the platforms target sequences outside coding exons (only 60 % of variants were found in introns, for Nimblegen sample). However, for WGS data, the ratio is equal to 2 as it’s expected (Ebersberger I. et al, 2002).
Looking at Frequency of alleles histogram, you can evaluate how many times an allele appears once (singleton), twice (doubleton), etc:
In all samples, most of the variants are represented as singletons. Some variants (less than 400,000 for WES, and about 1,5 million for WGS) have two alternate alleles.
Codon changes table outputs what and how many reference codons have been replaced. Here is just a fragment of this table:
Reference codons are shown in rows, changed codons — in columns. The most of changes happened are indicated in red color. For example, 811 ‘ACG’ reference codons have been replaced by ‘ACA’ triplet. If we compare this information between our samples, you’ll find the same type and almost the same number of codon changes across WES samples.
In Amino acid changes table, you can see type and number of amino acid changes. Row indicates a reference amino acid, column - changed amino acid.
For example, 957 Alanines (A, Ala) have been replaced by Tryptophan (T, Trp) in Nimblegen sample. Number and type of amino acid changes look pretty similar across WGS and different WES samples.
Changes by chromosome plots show the number of variants per 10000Kb throughout the whole chromosome length. Such histogram is generated for each chromosome and patch presented in the reference genome. Here is the example plot for chromosome 1:
Besides above mentioned plots and tables, you can see Details by gene as well.
We annotated the variants calculating the effects they produced on known genes and put them in Variants with predicted effects for Clark et al (2011) folder.
Variant prioritisation in Variant explorer¶
The variants can be also interactively analysed in Genestack Variant Explorer application:
Let’s select Illumina sample and open it in Variant Explorer to look at the detected variants:
There are 1,350,608 mutations were identified. Imagine that we are interested only in high-quality nonsense variants: click ‘QUALITY’ header to apply sorting and set ‘NONSENSE’ in ‘FUNCTIONAL CLASS’. You see that the number of mutations is decreased significantly. We have only 104 nonsense variants:
You can use other filters and sorting criteria and look through the ‘Filters history’ to check how many variants were detected after applying specific filter in comparison to the number of mutations we had on the previous filtering step:
When the variants are sorted and filtered, you can share them with your colleagues, export them as tsv file clicking on ‘Download table’ and attach it to your papers and other reports.
So, what can we conclude from our findings? Are the results for WES samples really comparable to a WGS one? If there are any key differences in performance between the three enrichment platforms? And what target capture technology is better to select when planning the exome experiment?
Answering these questions we found that neither of whole exome and whole genome technologies managed to cover all sequencing variants. First, WGS can not and will not replace exome sequencing as due to genome characteristics there will always be regions that are not covered sufficiently for variant calling. Regarding WES, it shows high coverage but only towards the target regions. Second, WGS has its value in identifying variants in regions that are not covered by exome enrichment technologies. These can be regions where enrichment fails, non-coding regions as well as regions that are not present on the current exome designs. That’s why, for covering really all variants, it might be worth to think about doing both WGS and WES experiments in parallel. Both technologies complement each other.
In general, all technologies performed well. Our results demonstrated that they give a very high level of targeting efficiency, with the Nimblegen technology demonstrating the highest one, and able to adequately cover the largest proportion of its target bases. Therefore, the Nimblegen is superior to the Agilent and Illumina TruSeq platforms for research restricted to the regions that it covers. The technologies target different exomic features but all of them cover a large portion of the overall exome with Illumina able to achieve the best exome coverage (~60 %). Moreover, the results showed that Agilent and Illumina platforms appeared to detect a higher total number of variants in comparison to Nimblegen one. That’s why the question of which enrichment platform is best must be answered with respect to all these specific parameters.
This is the end of this tutorial. We hope you found it useful and that you are now ready to make the most out of our platform. If you have any questions and comments, feel free to email us at firstname.lastname@example.org. Also we invite you to follow us on Twitter @genestack.
- Clark M.J., et al. Performance comparison of exome DNA sequencing technologies. Nature biotechnology 2011; 29(10):908-914
- Ebersberger I., et al. Genomewide comparison of DNA sequences between humans and chimpanzees. The American Journal of Human Genetics 2002; 70:1490–1497
- Mills R.E., et al. An initial map of insertion and deletion (INDEL) variation in the human genome. Genome Research 2006; 16:1182–1190
- van Dijk E.L., et al. Ten years of next-generation sequencing technology. Trends in Genetics 2014; 30:418-426