Whole-genome bisulfite sequencing data analysis

DNA methylation

Adapted from Lorenzen J.M., Martino F., Thum T. Epigenetic modifications in cardiovascular disease. Basic Res Cardiol 107:245 (2012)

Investigating the DNA methylation pattern is extremely valuable because this epigenetic system regulates gene expression. It is involved in embryonic development, genomic imprinting, X chromosome inactivation and cell differentiation. Since methylation takes part in many normal cellular functions, aberrant methylation of DNA may be associated with a wide spectrum of human diseases, including cancer. A strong interplay between methylation patterns and other epigenetic systems provides a unique picture of active and supressed genes. Indeed, methylomes of a malignant cell and a healthy one are different. Moreover, both hyper- and hypomethylation events are associated with cancer.

As methylation patterns of DNA are reversible and can be modified under the influence of the environment, epigenetics opens new possibilities in diagnosis and therapy of cancers and other severe diseases.

Bisulfite sequencing approaches are currently considered a “gold standard” allowing comprehensive understanding of the methylome landscape. Whole-genome bisulfite sequencing (WGBS), provides single- base resolution of methylated cytosines accross the entire genome.

In this tutorial step-by-step we will show you how to analyse and interprete bisulfite sequencing data with Genestack Platform.

The overall pipeline includes the following steps:

  1. Setting up a WGBS experiment
  2. Quality control of bisulfite sequencing reads
  3. Preprocessing of the raw reads: trimming adaptors, contaminants and low quality bases
  4. Bisulfite sequencing mapping of the preprocessed reads onto a reference genome
  5. Merging the mapped reads
  6. Quality control of the mapped reads
  7. Methylation ratio analysis
  8. Exploring the genome methylation levels in Genome Browser

The details will be further elaborated in the sections below. To follow along go to Tutorials folder in the Public data. Then select the Whole-Genome Bisulfite Sequencing Data Analysis on Genestack Platform folder, containing all the tutorial files we talk about here for your convenience. Find there processed files, explore results, and repeat the analysis steps on data of your interest with a WGBS data analysis (for Rodriguez et al., 2014) dataflow.

Setting up a WGBS experiment

For this tutorial we picked the data set by Rodriguez et al., 2014 from the Genestack collection of Public Experiments . Feel free to reproduce the workflow on any other relevant data set that you can find with Data Browser. If you do not find there needed experiment or you intend to analyse your own data use our Import Data application allowing to upload files from your computer or from URL.

In this experiment authors applied WGBS of genomic DNA to investigate the mechanisms that could promote changes in DNA methylation and contribute to cell differentiation and malignant transformation. They investigated the cytosine methylation profile in normal precursors of leukemia cells, hematopoietic stem cells (HSCs).

The team discovered novel genomic features, they called them DNA methylation canyons, that are uncommonly large DNA regions of low methylation. Canyons are distinct from CpG islands and associated with genes involved in development.

In mammals, a methyl group is added to cytosine residues by DNA methyltransferase (DNMT) enzymes. As the de novo DNA methyltransferase Dnmt3a is shown to be crucial for normal HSCs differentiation, and Dnmt3a gene is often mutated in human leukemias, the authors further explore how loss of Dnmt3a influences canyons.


Adapted from Jeong M. & Goodell M.A. New answers to old questions from genome-wide maps of DNA methylation in hematopoietic cells. Exp Hematol 42(8):609-617

They compared DNA methylation patterns in wild type and Dnmt3a knockout mouse HSCs. It turned out that the loss of Dnmt3a results in methylation changes at the edge of canyons and can influence canyon size.

Now let’s start reproducing these results with data flows pre-prepared by Genestack.

To learn more just open the experiment in Metainfo Editor:


Quality control of bisulfite sequencing reads

Quality control of raw reads is an essential step in our pipeline to ensure that the data is of high quality and is suitable for further analysis. The raw data may be contaminated with PCR primers and adapter dimers or it can contain bases of poor quality. Moreover, the bisulfite sequencing protocol utilizes sodium bisulfite conversion of unmethylated cytosines to thymidines and that reduces the sequence complexity resulting in errors during alignment.

With our platform you can quickly and easily obtain an interactive FastQC Report that covers different quality aspects of raw data such as general sequence quality, total number of reads, GC content distribution, read quality distribution and much more. To start the analysis we can use public Raw Reads Quality Control data flow. There is no need to repeat the same steps for each sample — the pipelines for all 16 assays from our experiment can easily be built simultaneously using this public data flow.

In the Data Flow Runner page you can see 16 raw reads samples from the experiment.


To start computation click on the button Run Data Flow and create resulting files. You will be suggested to initialize the created files at once or delay it till later.


The computation will take a while. You can track the progress of the report generation using our Task Manager that can be found at the top of the page. All created QC reports are located in the folder “Created files”. You can explore each report in FastQC Report app or compare QC-statistics for several samples using Multiple QC Report app.


If you decide to delay initialization, you can generate statistical reports for all the chosen samples directly on Multiple QC Report app page or from the Data Flow Runner page by clicking on ”16 files” and selecting “Start initialization”. Thereby, with just one click, the process will begin for all the samples we have selected.


We will demonstrate the examples of QC-reports using files previously prepared by our team. Go to the folder with processed files for our tutorial and click on the complete  Multiple QC Report for Raw Reads we have created for all 16 raw read files in order to compare the quality of our raw reads.


Looking at the plot we can see the number of nucleotides counted for each individual sample obtained from Dnmt3a-KO (blue) or WT (red) HSCs samples. Additionally, on the app page you could specify the statistics and metainfo which will be displayed on the plot and sort the samples by specific QC statistics or metainfo keys of choice.

Now let’s look at the  FastQC report for one of the assays, for example, “ko3a_b2l4 Bisulfite-Seq”. Per sequence GC content graph shows the GC content across the whole length of each read. Ideally, we will see the normal distribution of GC content. Our results reflect some deviation from from normal distribution: unusual sharp shape of the central peak may indicate the presence of contaminants in our library, for example adaptor dimers.


On the Per base sequence quality plots we can see that all bases in our sequence have the quality score equal or more than 30, which corresponds to 99.9 % base calling accuracy. The quality is degraded in the last bases, but it is an expected behaviour corresponding to the sequencing chemistry.


Per sequence quality score graph shows an average quality distribution over the set of sequences. It will help us see if there are any problems with sequencing run, for example a significant proportion of low quality sequences can be a signal of a systematic problem. In our case the overwhelming majority of reads are of a high quality (more than 30).


Let’s move on to the Per base sequence content graphs. The fact that our data failed this metric indicates that the base distribution is not uniform, namely the difference between A and T, or G and C is greater than 20 %. Indeed, we can see fluctuations in base compositions over the entire read length. This should not alarm us, because bisulfite treatment converts the most of the cytosines to thymines and that obviously affects the base composition. Looking at the plot we can see that the number of thymines is approximately 50 %, while cytosines are almost absent.


Sequence duplication levels metric allows us to assess the duplication level as well as the number of sequences that are not unique in the raw data. According to the plot, we have more than 30 % of non-unique sequences of the total in the assay. Such a high duplication level can be linked to PCR artefacts, contaminants or sequencing of the same area several times.


The application also detects Overrepresented sequences  that may correspond to primer or adapter contamination. Indeed, in our case two over-represented sequences were found in our assay. Here they are:


These contaminants can strongly influence the results of analysis and should be trimmed.

All prepared FastQC reports for all the samples are stored in the FastQC reports for Rodriguez et al., 2014 folder.

Preprocessing of raw reads

After checking the quality of our data,  we can proceed with appropriate steps for improving the original raw data in order to get reliable results in the downstream analysis.

The authors analysed two biological replicates for two murine phenotypes: wild type (WT) HSCs and conditionally Dnmt3a knocked out (KO) HSCs. Moreover, each biological replicate of WT or  Dnmt3a-null HSCs condition has several technical replicates. Let’s select the raw reads “m12_b4l1 Bisulfite-Seq”, “m12_b4l2 Bisulfite-Seq” and “m12_b3 Bisulfite-Seq” that are three technical replicates for the second biological replicate of WT HSCs from our experiment and right click on them. The “Data Flow for WGBS data analysis (for Rodriguez et al., 2014)” public data flow created for this tutorial is freely available on the platform, you can find it in the tutorial folder. To explore the data flow in more detail you can use Data flow runner app, where all the steps of our pipeline are schematically represented.


In the first block you will see the source files we have just selected. Also you need to specify reference genome onto which our reads will be mapped. So Choose sources, find appropriate murine reference genome and Select.


Let’s run data flow by click on the corresponding button and take a closer look at all the steps of our pipeline. As we will describe below, we will run this data flow several times to obtain methylation ratios for biological replicates of the two tested phenotypes separately. The first part of our pipeline is preprocessing of raw sequencing data. Based on the QC statistics we highly recommend you to  remove adapters and contaminants, trim low quality bases and remove duplicates. And we also remove duplicates during Methylation Ratio Analysis, but you can also use a separate preprocess application  Remove Duplicated Reads. Firstly, we can easily remove the found overrepresented sequences from WGBS data using Trim adapters and contaminants app.

Later, to avoid mismatches in read mapping, we should remove low quality bases from the sequencing reads. Trim low quality bases application allows you to get rid of nucleotide bases with a low phred33 quality which corresponds to an error threshold equal to 1 %.

All preprocessed files are freely accessible in the folders Trim adaptors for Rodriguez et al., 2014 and Trim low quality bases for Rodriguez et al., 2014.

Mapping reads onto a reference genome

Following the preprocessing, our data is of improved quality and we can move on to the next step — alignment of trimmed reads onto the reference genome.

We run Bisulfite Sequencing Mapping with default parameters. Click on the app name to move to its page where you can change the parameters of alignment and learn more about the app clicking on its name.

In the Mapped reads for Rodriguez et al., 2014 folder you can find all the Mapped Reads.

Merging of the mapped reads obtained from technical replicates

It is essential for any experiment to create a good and adequate control. Using replicates can improve quality and precision of your experiment especially when comparing several experimental conditions. There are two types of replicates: technical and biological. When the same biological sample is prepared and sequenced separately, this gives us technical replicates. But if different biological samples are treated the same way but prepared separately, we are talking about biological replicates. If there are no changes in sample preparation, technical replicates could be considered as controls for the samples under the same experimental conditions. As we remember, our experiment contains both biological and technical replicates for 12-month-old wild-type HSCs and  Dnmt3a-KO HSCs. As authors do not mention using different experimental conditions for technical replicates, we can merge them before the calculation of methylation ratios. We will not merge biological replicates, because significant biological variability may be present between two sample growing separately.

For each biological replicate create a subset including technical replicates that will be merged. As a result, you will get 4 merged mapped reads for both analysed murine phenotypes. You can also use prepared Merged Mapped Reads by opening the Merged mapped reads for Rodriguez et al., 2014 folder.

Quality control of mapped reads

Before proceeding to the next analysis step we would recommend you to check the quality of mapped reads because the reads might look alright in raw reads quality control check but some issues, such as low coverage, homopolymer biases, experimental artifacts, only appears after alignment.

Sequencing coverage is an important quality metric, as biases in sample preparation, sequencing and read mapping can result in genome regions that lack coverage or in regions with much higher coverage than theoretically expected. You could explore the coverage for merged mapped reads in our Genome Browser. Go to the folder created, find merged mapped reads file we have created and click on the name of the merged mapped reads files, go to explore option and select the Genome Browser. Also go to our tutorial folder, find there Coverage for merged mapped reads for Rodriguez et al., 2014 file and open it in Genome browser app to explore results for initialized mapped reads.

In Genome Browser you have the option of viewing the region according to its coordinates or alternatively, you can search for a given gene or transcript name. Let’s check the coverage in the location of the large methylation-depleted region associated with Pax6 gene.


High-quality reads should have uniform and high enough coverage. Our last three samples are very good for the further processing. At the same time, the coverage for the first sample— biological replicate 2 of 12 HSCs — is uneven — with gaps and regions with coverage up to x133. That is why, such results may generate biases during methylation ratios calculation.


You can explore mapping quality for each individual file: right click merged reads file name, go to explore and select Mapped reads QC report. On the opened app page you should click generate reports to start calculation of quality control statistics. To check the quality of all mapped reads or merged mapped reads simultaneously use the Mapped Reads Quality Control public data flow as we done previously for Raw Reads Quality Control. Mapped reads QC report describes results of alignment of preprocessed reads onto the reference genome, and for example for paired-end reads it contains some mapping and insert size statistics, coverage  by chromosome plot and insert size distribution. When the computation is finished, you can open Mapped reads QC report for each individual file or explore the mapping statistics for all of them using Multiple QC Plotter. In order to do that find all of the merged find the generated mapped reads QC-reports, select all the four, and with context menu open all of them in Multiple QC plotter. Below you can see  QC-report for merged mapped reads generated by our team:


As we can see,  at most 6 % of the reads are improperly mapped out of around 65 % of mapped reads among all the samples. Remember that you can explore QC reports not only for merged mapped reads but also for mapped reads. All the prepared QC reports are located in the folders Mapped reads QC for Rodriguez et al., 2014 and Merged Mapped Reads QC for Rodriguez et al., 2014.

Methylation ratio analysis

When sequencing reads for both murine phenotypes are mapped onto reference genome with high enough quality, we can move on to the very last step in our pipeline — determining DNA methylation status of every single cytosine position on the both strands. In order to do that, go back to the Data Flow Runner page. Click on the Methylation Ratio Analysis to go to the app page where you can see source files and command line options that could be easily changed.

Then return to the data flow click on action, add files, chose the remaining merged mapped reads, and start initialization. During this step we apply several options to remove technical biases in WGBS data:

  1. Trim N end-repairing fill-in bases set to “3”. This option allows to trim 3 bases from the read end to remove the DNA overhangs created during read end-repair in library preparation. It is important because this end repair procedure may introduce artefacts if the repaired bases contain methylated cytosines.
  2. Report only unique mappings
  3. Discard duplicated reads option to remove duplicated reads which have identical sequences and could be the result of library preparation. These reads could be mapped to the same position and distort results of downstream analysis.

The folder Methylation Ratios for Rodriguez et al., 2014 contains all the resulting files of methylation ratios estimation.

Exploring the genome methylation levels in Genome browser

We can explore the distribution of genome methylation levels counted for both murine phenotypes in Genome Browser.

As it was mentioned before, “Canyons” are the large unmethylated DNA regions inside of a highly methylated locus that often harbour genes, such as Hoxa9 and Meis1, playing a role in hematopoiesis and being deregulated in human leukemia.

Some Canyons can be exceptionally large, for example one associated with the Pax6 homeobox gene encoding a homeobox-containing protein regulating transcription is extended over 25 kb:


Let’s compare our methylation ratios distribution in these region with author’s results:


To further examine the impact of Dnmt3a loss on the Canyon size, authors compared low-methylated DNA regions in HSCs with conditional inactivation of Dnmt3a gene to those in the WT cells:


This investigation revealed that methylation loss in Dnmt3a KO HSCs leads to the formation of new Canyons. Lack of Dnmt3a does not affect regions inside Canyons but it results in changes of Canyon edges. Boundaries of canyons became hotspots of differential methylation: Canyon size can be decreased due to hypermethylation or increased due to hypomethylation.

Moreover, at DNA regions containing cluster of Canyons in WT HSCs, larger Canyons (“Grand Canyons”) can be formed. We can see it on the example of HoxB regions in which Canyons are interrupted by short stretches of higher methylation.

All these findings suggest that Dnmt3a can be crucial for maintaining methylation in the Canyon boundaries.

Now, let’s take a look at the original track for the same Canyon cluster to compare the results:


This experiment is a part of the large research of changes in DNA methylation profile including different methodologies such as, for example, whole genome bisulfite sequencing and CMS-seq to reveal genome-wide distribution of mCs and hmCs, RNA-Seq to analyse expression of Canyon-associated genes. This incredible work was turned into a research paper, and the data sets can be found in our Public Experiments.

This is the end of this tutorial. We hope you found it useful and that you are now ready to make the most out of our platform. If you have any questions and comments, feel free to email us at support@genestack.com. Also we invite you to follow us on Twitter @genestack.


  • Challen G.A. et al. Dnmt3a is essential for hematopoietic stem cell differentiation. Nat Genet. 44:23–31 (2012)
  • De Carvalho D.D. et al. DNA Methylation Screening Identifies Driver Epigenetic Events of Cancer Cell Survival. Cancer Cell 21(5):655-667 (2012)
  • Ehrlich M. DNA methylation in cancer: too much, but also too little. Oncogene 21:5400-5413 (2002)
  • Jeong N. et al. Large conserved domains of low DNA methylation maintained by Dnmt3a. Nat Genet 46(1):17–23 (2014)
  • Jeong M. & Goodell M.A. New answers to old questions from genome-wide maps of DNA methylation in hematopoietic cells. Exp Hematol 42(8):609-617
  • Kulis M., Esteller M. DNA methylation and cancer. Adv Genet 70:27-56 (2010)
  • Ley T.J. et al. DNMT3A mutations in acute myeloid leukemia. N Engl J Med. 363:2424–2433 (2010)