Testing differential gene expression

One of the most widespread applications of RNA-Seq technology is differential gene expression (DGE) analysis. By how gene expression levels change across different experimental conditions, we can gain clues about gene function and learn how genes work together to carry out biological processes.

In this tutorial we will use Genestack applications to identify differentially expressed (DE) genes and further annotate them according to biological process, molecular function, and cellular component.

The whole analysis includes the following steps:

  1. Setting up an RNA-Seq experiment
  2. Quality control of raw reads
  3. Preprocessing of raw reads
  4. Mapping RNA-Seq reads onto a reference genome
  5. Quality control of mapped reads
  6. Calculate read coverage for genes
  7. Differential gene expression analysis

Let’s deal with these steps one by one.

Setting up an RNA-seq experiment

The first step is to choose RNA-Seq dataset. There are more than 110.000 datasets (see the Public experiments folder) imported from various biological databases and public repositories (e.g. ArrayExpress, GEO, SRA, and ENA). All these data are absolutely free to use. The Data Browser app allows you to explore already existing RNA-Seq data. Data Browser provides filters that help to find the most appropriate data, that you can use it further in the analysis. Besides searching through all the data available on the platform, you can also upload your own data with the Import app.

Feel free to find all the data for this tutorial in the folder Testing Differential Gene Expression on Genestack Platform. To find it go to File Manager app and open the “Tutorials” folder located in the Public Data.

The RNA-Seq experiment we will use comes from Hibaoui et al. 2013 and is publicly available on Genestack. Open it in Metainfo Editor to see more details:

../../_images/DGE_metainfo.png

The authors investigated the transcriptional signature of Down syndrome (trisomy 21) during development, and analysed mRNA of induced pluripotent stem cells (iPSCs) derived from fetal fibroblasts of monozygotic twins discordant for trisomy 21: three replicates from iPSCs carrying the trisomy, and four replicates from normal iPSCs. They identified down-regulated genes expressed in trisomic samples and involved in multiple developmental processes, specifically in nervous system development. Genes up-regulated in Twin-DS-iPSCs are mostly related to the regulation of transcription and different metabolic processes.

To reproduce these results, we will use Differential Gene Expression Analysis data flow. But before let’s check the quality of raw reads to decide whether we should improve it or not.

Quality control of raw reads

Raw sequencing reads can include PCR primers, adaptors, low quality bases, duplicates and other contaminants coming from the experimental protocols. That is why we recommend you to check the quality of your raw data looking at such aspects as GC percentage, per base sequence quality scores, and other quality stаtistics. The easiest way to do this is to run the Raw Reads QC data flow:

After generating reports, you will be able to review various statistics and plots for each sample.

Per sequence GC content. In a random library you expect a roughly normal GC content distribution. An unusually shaped or shifted distribution could indicate a contamination.

../../_images/DGE_per_seq_GC_cont.png

Per base sequence quality plot depicts the range of quality scores for each position in the reads. A good sample will have qualities all above 28:

../../_images/DGE_per_base_s.png

Per sequence quality scores plot shows the frequencies of quality scores in a sample. If the reads are of good quality, the peak on the plot should be shifted to the right as far as possible. In our case, the majority of Illumina reads have a good quality - in the range from 35 to 40 (the best score is 41):

../../_images/DGE_per_seq_quality_sco.png

Per base sequence content shows the four nucleotides’ proportions for each position. In a random library you expect no nucleotide bias and the lines should be almost parallel with each other:

../../_images/DGE_per_base_seq_cont.png

There is a bias at the beginning of the reads, which is common for RNA-Seq data. This occurs during RNA-Seq library preparation, when “random” primers are annealed to the start of sequences. These primers are not truly random, and it leads to a variation at the  beginning of the reads.

Sequence duplication levels plots represent the proportion of the library made up of sequences with different duplication levels. Sequences with 1, 2, 3, 4, etc duplicates are grouped to give the overall duplication level.

../../_images/DGE_seq_dupl_lev.png

Looking at these plots, you may notice 15 % of sequences are duplicated more than 10 times, 6 % of sequences are repeated more than 100 times, etc. The overall rate of  duplication is about 40 %. Nevertheless, while analysing transcriptome sequencing data, we should not remove these duplicates because we do not know whether they represent PCR duplicates or high gene expression of our samples.

We have run QC on all the data in the experiment and collected reports in Raw reads QC reports for Hibaoui et al (2013) folder.

Preprocessing of raw reads

QC reports not only provide you with the information on the data quality but can also help you to decide how to preprocess the data in order to improve its quality and get more reliable results in further analysis. There are various Genestack applications that allow you to do preprocessing, we will use Trim Adaptors and Contaminants app, as a first procedure of preprocessing.

../../_images/DGE_preprocess_apps.png

Once the quality of raw data has been checked, we can go back to the main Differential Gene Expression Analysis data flow and choose sources: 7 raw reads from the tested dataset and a human reference genome. You can select the input data from the existing datasets or upload files directly into the data flow using Import Data.

../../_images/DGE_data_flow_1.png

After that, we run the data flow to create all the files.

../../_images/DGE_data_flow_2.png

All resulting files are collected in Trimmed raw reads for Hibaoui et al (2013) folder.

Mapping RNA-seq reads onto a reference genome

When all files were created, you can run the whole analysis here, choosing Expression Navigator for genes. But first, let’s align RNA-Seq reads to the reference genome across splice junctions and then explore mappings in Genome Browser.

Find Spliced Mapping step, click on “7 files”. In “Explore” section choose Genome Browser and start initialization there.

We run Spliced Mapping app with default parameters. To change them go to the app page and choose Edit parameters button. If you want to learn more about the app and its options, click on the app name and then on About application.

../../_images/DGE_spl_map.png

Find completed Mapped Reads files in Mapped reads files for Hibaoui et al (2013) folder. Let’s open some of them in Genome Browser to analyse reads coverage on chromosome 21 on the region chr21:30007376-40007694 (10 Mb):

../../_images/DGE_coverage_21.png

Here is a combined track for all trisomic and control samples:

../../_images/DGE_GB_combined_track.png

As you see, the majority of chr21 genes are indeed more expressed in the trisomic samples than in the euploid ones, which is consistent with the overall up-regulation of chr21 genes in individuals with Down syndrome.

Quality control of mapped reads

The optional step is to check how mapping went using Mapped Reads QC Report app. You can “generate reports” for each mapping separately or just run Mapped Reads Quality Control data flow for multiple samples:

Output report includes mapping statistics such as:

  1. Mapped reads: total reads which mapped to the reference genome;
  2. Unmapped reads: total reads which failed to map to the reference genome;
  3. Mapped reads with mapped mate: total paired reads where both mates were mapped;
  4. Mapped reads with partially mapped mate: total paired reads where only one mate was mapped;
  5. Mapped reads with “properly” mapped mate: total paired reads where both mates were mapped with the expected orientation;
  6. Mapped reads with “improperly” mapped mate: total paired reads where one of the mates was mapped with unexpected orientation.

The Coverage by chromosome plot shows the percentage of bases covered (y-axis) by at least N (x-axis) reads.

../../_images/Coverage_by_chromosome.png

For paired reads, you can look at insert size statistics, such as median and mean insert sizes, median absolute deviation and standard deviation of insert size. The Insert size distribution plot is generated:

../../_images/Insert_size_distribution.png

We already prepared all QC reports for mapped reads and put them in Mapped reads QC reports for Hibaoui et al (2013) folder. You can open all of them in Multiple QC Report app to view mapping statistics interactively:

../../_images/DGE_multiple_qc_plotter.png

Overall, more than 80 % of reads are mapped. It includes properly and partially mate pairs. Less than 11 % of reads are unmapped among the samples. Additionally, you can sort your samples by QC statistics or metainfo values. Read more what the app does in our blog post about interactive sequencing quality control reports.

Calculate read coverage for genes

After mapping, we can count reads mapped to annotated features (genes, exons, etc.) running Quantify Raw Coverage in Genes app. To run the app, click on “7 files” and then “Start initialization”.

../../_images/rpkm-counts-start-initialization.png

For our analysis we counted reads mapping within exons, grouped them by gene_id and assigned reads to all exons they overlap with. We calculated read coverage in all samples and collected resulting files in Raw gene counts for Hibaoui et al (2013) folder.

Differential gene expression analysis

The final step is to identify genes whose patterns of expression differ according to different experimental conditions. In this tutorial, we are looking for variation in gene expression for trisomic samples compared to the control ones.

Open Expression Navigator file, re-group samples and start the analysis:

We prepared two Differential Expression Statistics files (considering the DE genes reported by both packages) and stored them in Differential gene expression analysis for Hibaoui et al (2013) folder. As an example, let’s analyse DE genes reported by DESeq2 package. You can see the table with top genes that are differentially expressed in one particular group compared to the average of the other groups. The table shows the corresponding Log FC (log fold change), Log CPM (log counts per million), p-value, and FDR (false discovery rate) for each gene. Genes with positive Log FC are considered to be up-regulated in the selected group, ones with negative Log FC are down-regulated. In the “Trisomy 21” group we identified 4426 low expressed genes (NR2F1, XIST, NEFM, etc.) and 4368 highly over-expressed genes (ZNF518A, MYH14, etc.). By selecting the checkbox next to a gene, more detailed information about that gene is displayed: its Ensembl identifier, description and location.

../../_images/DGE_DGE_table.png

There are several options to filter/sort the genes displayed on the “Top Differentially Expressed Genes” table. You can filter them by minimum Log FC, minimum CPM and regulation type. By default, the genes are ranked by their FDR.

tutorials/DGE_analysis/images/filters.png

Let’s find genes that are most over-expressed in the “Trisomy 21” group by lowering the Max P-value threshold and increasing the Min LogFC and Min LogCPM thresholds. Change Regulation to “Up”, set both Min LogFC and Min LogCPM equal to ‘2’ and apply sorting by LogFC. As consistent with paper results, there is a number of zinc finger protein genes that are up-regulated in Twin-DS-iPSCs:

../../_images/zinc_finger.png

Interactive counts graph shows gene normalised counts across samples. This allows you to observe how a gene’s expression level varies within and across groups. Select several genes to compare expression level distributions between them:

../../_images/zinc_DGE_plot.png

If you move cursor to the top right corner of the graph, three icons will appear:

  1. Filter icon lets you filter the graph data by samples, groups, features, etc.
  2. Data icon will display all the data contained in the graph, and allow you to save the table data locally.
  3. Camera icon lets you save the displayed graph locally. Add labels to the graph and change its appearance by modifying the parameters on display when you right-click the graph area.

This is the end of this tutorial. We hope you found it useful and that you are now ready to make the most out of our platform. If you have any questions and comments, feel free to email us at support@genestack.com. Also we invite you to follow us on Twitter @genestack.