Testing differential gene expression¶
One of the most widespread applications of RNA-Seq technology is differential gene expression (DGE) analysis. By how gene expression levels change across different experimental conditions, we can gain clues about gene function and learn how genes work together to carry out biological processes.
In this tutorial we will use Genestack applications to identify differentially expressed (DE) genes and further annotate them according to biological process, molecular function, and cellular component.
The whole analysis includes the following steps:
- Setting up an RNA-Seq experiment
- Quality control of raw reads
- Preprocessing of raw reads
- Mapping RNA-Seq reads onto a reference genome
- Quality control of mapped reads
- Calculate read coverage for genes
- Differential gene expression analysis
Let’s deal with these steps one by one.
Setting up an RNA-seq experiment¶
The first step is to choose RNA-Seq dataset. There are more than 110.000 datasets (see the Public experiments folder) imported from various biological databases and public repositories (e.g. ArrayExpress, GEO, SRA, and ENA). All these data are absolutely free to use. The Data Browser app allows you to explore already existing RNA-Seq data. Data Browser provides filters that help to find the most appropriate data, that you can use it further in the analysis. Besides searching through all the data available on the platform, you can also upload your own data with the Import app.
Feel free to find all the data for this tutorial in the folder Testing Differential Gene Expression on Genestack Platform. To find it go to File Manager app and open the “Tutorials” folder located in the Public Data.
The authors investigated the transcriptional signature of Down syndrome (trisomy 21) during development, and analysed mRNA of induced pluripotent stem cells (iPSCs) derived from fetal fibroblasts of monozygotic twins discordant for trisomy 21: three replicates from iPSCs carrying the trisomy, and four replicates from normal iPSCs. They identified down-regulated genes expressed in trisomic samples and involved in multiple developmental processes, specifically in nervous system development. Genes up-regulated in Twin-DS-iPSCs are mostly related to the regulation of transcription and different metabolic processes.
To reproduce these results, we will use Differential Gene Expression Analysis data flow. But before let’s check the quality of raw reads to decide whether we should improve it or not.
Quality control of raw reads¶
Raw sequencing reads can include PCR primers, adaptors, low quality bases, duplicates and other contaminants coming from the experimental protocols. That is why we recommend you to check the quality of your raw data looking at such aspects as GC percentage, per base sequence quality scores, and other quality stаtistics. The easiest way to do this is to run the Raw Reads QC data flow:
After generating reports, you will be able to review various statistics and plots for each sample.
Per sequence GC content. In a random library you expect a roughly normal GC content distribution. An unusually shaped or shifted distribution could indicate a contamination.
Per base sequence quality plot depicts the range of quality scores for each position in the reads. A good sample will have qualities all above 28:
Per sequence quality scores plot shows the frequencies of quality scores in a sample. If the reads are of good quality, the peak on the plot should be shifted to the right as far as possible. In our case, the majority of Illumina reads have a good quality - in the range from 35 to 40 (the best score is 41):
Per base sequence content shows the four nucleotides’ proportions for each position. In a random library you expect no nucleotide bias and the lines should be almost parallel with each other:
There is a bias at the beginning of the reads, which is common for RNA-Seq data. This occurs during RNA-Seq library preparation, when “random” primers are annealed to the start of sequences. These primers are not truly random, and it leads to a variation at the beginning of the reads.
Sequence duplication levels plots represent the proportion of the library made up of sequences with different duplication levels. Sequences with 1, 2, 3, 4, etc duplicates are grouped to give the overall duplication level.
Looking at these plots, you may notice 15 % of sequences are duplicated more than 10 times, 6 % of sequences are repeated more than 100 times, etc. The overall rate of duplication is about 40 %. Nevertheless, while analysing transcriptome sequencing data, we should not remove these duplicates because we do not know whether they represent PCR duplicates or high gene expression of our samples.
We have run QC on all the data in the experiment and collected reports in Raw reads QC reports for Hibaoui et al (2013) folder.
Preprocessing of raw reads¶
QC reports not only provide you with the information on the data quality but can also help you to decide how to preprocess the data in order to improve its quality and get more reliable results in further analysis. There are various Genestack applications that allow you to do preprocessing, we will use Trim Adaptors and Contaminants app, as a first procedure of preprocessing.
Once the quality of raw data has been checked, we can go back to the main Differential Gene Expression Analysis data flow and choose sources: 7 raw reads from the tested dataset and a human reference genome. You can select the input data from the existing datasets or upload files directly into the data flow using Import Data.
After that, we run the data flow to create all the files.
All resulting files are collected in Trimmed raw reads for Hibaoui et al (2013) folder.
Mapping RNA-seq reads onto a reference genome¶
When all files were created, you can run the whole analysis here, choosing Expression Navigator for genes. But first, let’s align RNA-Seq reads to the reference genome across splice junctions and then explore mappings in Genome Browser.
Find Spliced Mapping step, click on “7 files”. In “Explore” section choose Genome Browser and start initialization there.
We run Spliced Mapping app with default parameters. To change them go to the app page and choose Edit parameters button. If you want to learn more about the app and its options, click on the app name and then on About application.
Find completed Mapped Reads files in Mapped reads files for Hibaoui et al (2013) folder. Let’s open some of them in Genome Browser to analyse reads coverage on chromosome 21 on the region chr21:30007376-40007694 (10 Mb):
Here is a combined track for all trisomic and control samples:
As you see, the majority of chr21 genes are indeed more expressed in the trisomic samples than in the euploid ones, which is consistent with the overall up-regulation of chr21 genes in individuals with Down syndrome.
Quality control of mapped reads¶
The optional step is to check how mapping went using Mapped Reads QC Report app. You can “generate reports” for each mapping separately or just run Mapped Reads Quality Control data flow for multiple samples:
Output report includes mapping statistics such as:
- Mapped reads: total reads which mapped to the reference genome;
- Unmapped reads: total reads which failed to map to the reference genome;
- Mapped reads with mapped mate: total paired reads where both mates were mapped;
- Mapped reads with partially mapped mate: total paired reads where only one mate was mapped;
- Mapped reads with “properly” mapped mate: total paired reads where both mates were mapped with the expected orientation;
- Mapped reads with “improperly” mapped mate: total paired reads where one of the mates was mapped with unexpected orientation.
The Coverage by chromosome plot shows the percentage of bases covered (y-axis) by at least N (x-axis) reads.
For paired reads, you can look at insert size statistics, such as median and mean insert sizes, median absolute deviation and standard deviation of insert size. The Insert size distribution plot is generated:
We already prepared all QC reports for mapped reads and put them in Mapped reads QC reports for Hibaoui et al (2013) folder. You can open all of them in Multiple QC Report app to view mapping statistics interactively:
Overall, more than 80 % of reads are mapped. It includes properly and partially mate pairs. Less than 11 % of reads are unmapped among the samples. Additionally, you can sort your samples by QC statistics or metainfo values. Read more what the app does in our blog post about interactive sequencing quality control reports.
Calculate read coverage for genes¶
After mapping, we can count reads mapped to annotated features (genes, exons, etc.) running Quantify Raw Coverage in Genes app. To run the app, click on “7 files” and then “Start initialization”.
For our analysis we counted reads mapping within exons, grouped them by gene_id and assigned reads to all exons they overlap with. We calculated read coverage in all samples and collected resulting files in Raw gene counts for Hibaoui et al (2013) folder.
Differential gene expression analysis¶
The final step is to identify genes whose patterns of expression differ according to different experimental conditions. In this tutorial, we are looking for variation in gene expression for trisomic samples compared to the control ones.
Open Expression Navigator file, re-group samples and start the analysis:
We prepared two Differential Expression Statistics files (considering the DE genes reported by both packages) and stored them in Differential gene expression analysis for Hibaoui et al (2013) folder. As an example, let’s analyse DE genes reported by DESeq2 package. You can see the table with top genes that are differentially expressed in one particular group compared to the average of the other groups. The table shows the corresponding Log FC (log fold change), Log CPM (log counts per million), p-value, and FDR (false discovery rate) for each gene. Genes with positive Log FC are considered to be up-regulated in the selected group, ones with negative Log FC are down-regulated. In the “Trisomy 21” group we identified 4426 low expressed genes (NR2F1, XIST, NEFM, etc.) and 4368 highly over-expressed genes (ZNF518A, MYH14, etc.). By selecting the checkbox next to a gene, more detailed information about that gene is displayed: its Ensembl identifier, description and location.
There are several options to filter/sort the genes displayed on the “Top Differentially Expressed Genes” table. You can filter them by minimum Log FC, minimum CPM and regulation type. By default, the genes are ranked by their FDR.
Let’s find genes that are most over-expressed in the “Trisomy 21” group by lowering the Max P-value threshold and increasing the Min LogFC and Min LogCPM thresholds. Change Regulation to “Up”, set both Min LogFC and Min LogCPM equal to ‘2’ and apply sorting by LogFC. As consistent with paper results, there is a number of zinc finger protein genes that are up-regulated in Twin-DS-iPSCs:
Interactive counts graph shows gene normalised counts across samples. This allows you to observe how a gene’s expression level varies within and across groups. Select several genes to compare expression level distributions between them:
If you move cursor to the top right corner of the graph, three icons will appear:
- Filter icon lets you filter the graph data by samples, groups, features, etc.
- Data icon will display all the data contained in the graph, and allow you to save the table data locally.
- Camera icon lets you save the displayed graph locally. Add labels to the graph and change its appearance by modifying the parameters on display when you right-click the graph area.
This is the end of this tutorial. We hope you found it useful and that you are now ready to make the most out of our platform. If you have any questions and comments, feel free to email us at firstname.lastname@example.org. Also we invite you to follow us on Twitter @genestack.