Applications review

Applications available on Genestack are grouped into four categories:

  • Preprocess applications cover tasks such as data prefiltering, subsampling or normalisation, which typically should be performed before getting into the “heavy-lifting” part of data analysis.
  • Analyse applications include key analysis steps like sequence alignment, variant calling, expression quantification, etc.
  • Explore contains all interactive graphical interface applications allowing users to view the results of their computations, such as applications for visualizing QC reports, Genome Browser, Variant Explorer etc.
  • Manage contains applications used to manage your data: applications dealing with data flows, file provenance, export, metadata editing and so on.

An extended version of each application’s description can be found in the “About application” text for that application.

To view this text for a specific application, click on the application’s name at the top-left corner of the page, and in the drop-down menu select “About application”.

../_images/about_app.png

Sequencing data

Raw reads quality control and preprocessing

Once you have got raw sequencing data in Genestack the next steps are to check the quality of the reads and improve it if it is necessary. Let’s go through the applications developed to assess the quality of the data and do preprocessing.

FastQC report

Action: to perform quality control (QC) of raw sequencing reads. According to the “garbage in, garbage out” rule, if we begin our analysis with poor quality reads, we should not expect great results at the end. This is why QC is the essential first step of any analysis.

The FastQC Report application is based on the FastQC tool developed by Simon Andrews at the Babraham Institute.

The application generates a separate FastQC Report for each sample from a tested dataset; you can find them in the folder “Created files”. You can also explore all of them simultaneously with Multiple QC Report app. To do so, go to “My datasets” folder, select the created “FastQC Report” dataset and open it with Multiple QC Report using the context menu.

Fast QC Report contains various graphs that visualize the quality of your data. We will go through all of them one by one and tell you:

  1. How they should look for good-quality data;
  2. How they may look if there is something wrong with your data;
  3. What you can do if the quality is unsatisfactory.

The metrics table gives you quick indicators as to the status of each of the quality metrics calculated.

1. Basic statistics
../_images/fastqc_basic_statistics.png

Information on type and number of reads, GC content, and total sequence length.

2. Sequence length distribution
../_images/fastqc_sequence_length_distribution.png

Reports lengths of all sequences.

Warning: this report will get warnings if the lengths are not identical, but this can usually be ignored, as it is expected for some sequencing platforms.

3. Per sequence GC content
../_images/fastqc_per_sequence_gc_content.png

For data of good quality, the graph will show a normal, bell-shaped distribution.

Warning: when the sum of the deviations from the normal distribution represents more than 15% of the reads, the report will raise a warning.

Warnings are usually caused by a presence of contaminants. Sharp peaks may represent a presence of a very specific contaminant (e.g. an adaptor). Broader peaks may indicate contamination with a range of contaminants.

Improving data quality: run Trim Adaptors and Contaminants preprocessing application.

4. Per base sequence quality
../_images/fastqc_per_base_sequence_quality.png

For data of good quality, the median quality score per base (Phred) should not drop below 20.

Failure: the report will get failures if the lower quartile for quality at any base position is less than 5 or if the median for any base is less than 20.

Improving data quality: if the quality of the library falls to a low level over the course of a read, the blueprint solution is to perform quality trimming of low quality bases or omitting low quality reads. This can be performed using Trim Low Quality Bases or Filter by Quality Scores applications respectively.

5. Per sequence quality scores
../_images/fastqc_per_sequence_quality_scores.png

Ideally, we expect to see a sharp peak at the very end of the graph (meaning most frequently observed mean quality scores are above 27).

Warning: the report will get warnings when the peak is shifted to the left, which means the most frequently observed mean quality is below 27. This equals to a 0.2% error rate.

Improving data quality: perform quality-based trimming or selection using Trim Low Quality Bases or Filter by Quality Scores applications respectively.

6. Per base sequence content
../_images/fastqc_per_base_sequence_content.png

Ideally, in a random library we would see four parallel lines representing the relative base composition. Fluctuations at the beginning of reads in the tested sample may be caused by adapter sequences or other contaminations of the library.

A bias at the beginning of the reads is common for RNA-seq data. This occurs during RNA-seq library preparation when “random” primers are annealed to the start of sequences. These primers are not truly random, and it leads to a variation at the  beginning of the reads.

Warning: a warning will be raised if the difference between A and T, or G and C is greater than 10% at any position.

Improving data quality: if there is instability at the start of the read the consensus is that no QC is necessary. If variation appears over the course of a read Trim to Fixed Length application may be used. If there is persistent variation throughout the read it may be best to discard it. Some datasets may trigger a warning due to the nature of the sequence. For example, bisulfite sequencing data will have almost no Cytosines. Some species may be unusually GC rich or poor and therefore also trigger a warning.

7. Sequence duplication levels
../_images/fastqc_sequence_duplication_levels.png

Reports total number of reads, number of distinct reads and mean duplication rates.

Warning: this module will issue a warning if non-unique sequences make up more than 20% of the total.

There are two potential types of duplicates in a library: technical duplicates arising from PCR artifacts or biological duplicates which are natural collisions where different copies of exactly the same sequence are randomly selected. From a sequence level, there is no way to distinguish between these two types and both will be reported as duplicates here.

Improving data quality: if the observed duplications are due to primer/adaptor contamination, they can be removed using the Trim Adaptors and Contaminants application. Filter Duplicated Reads application can also be used for DNA sequencing data but will distort expression data.

8. Overrepresented sequences
../_images/fastqc_overrepresented_sequences.png

Shows the highly overrepresented sequences (more than 0.1% of total sequence) in the sample.

Warning: if any sequence is found to represent more than 0.1% of the total, a warning will be raised.

There are several possible sources of overrepresented sequences:

  • technical biases (one region was sequenced several times; PCR amplification biases);
  • a feature of library preparation (e.g. for targeted sequencing);
  • natural reasons (RNA-seq libraries can naturally present high duplication rates).

Overrepresented sequences should only worry you if you think they are present due to technical biases.

Improving data quality: procedures and caveats for improving data quality are the same as for sequence duplication level.

You can explore all the generated FastQC reports at the same time and on one page with Multiple QC report application. All the FastQC reports are kept together in a “FastQC Report” dataset in the “My Datasets” folder.

Multiple QC report

Action: to display metrics from multiple reports at once. It accepts as input a dataset of QC reports.

Select from a range of QC keys to display on the plot, e.g. Total nucleotide count (mate 1 and 2), Number of reads (mate 1 and 2): count (mate 1 and 2), Number of reads (mate 1 and 2):

../_images/multiple_qc_report_qc_keys.png

You can select which metainfo to display in the plot labels:

../_images/multiple_qc_report_metinfo.png

Also, samples in the Multiple QC Report can be sorted by metainfo key or specified QC metric.

../_images/multiple_qc_report_sorting.png

Finally, you can highlight the interesting reports and put them in a separate folder (New folder with selection button).

../_images/multiple_qc_report_select_reports.png

When the quality of the raw reads is unsatisfactory, several preprocessing applications are available on the platform that can increase the quality of your raw reads. Here we will walk you through each one and give you a checklist to use when deciding which to select. After each of the preprocessing steps, you can use the FastQC Report application again to compare the quality pre- and post-processing (remember that in order to do this, you need to run a different computation, this time inputting processed data source files into the data flow).

Subsample reads

Action: to create a random subset of raw reads.

Command line options:

  1. The Random seed value will let you create different subsets with the same number of reads. (default: 100)
  2. The Number of reads in subset option tells the application how many reads you expect the output subsample will contain. (default: 50,000)

Using the same seed and the same number of reads will result in identical subsets.

This application is based on the Seqtk.

Filter duplicated reads

Action: to discard duplicated sequenced fragments from raw reads data. If a sequence of two paired reads or a single read occurs multiple times in a library, the output will include only one copy of that sequence.

The Phred quality scores are created by keeping the highest score across all identical reads for each position.

If you suspect contamination with primers or some other repetitive sequence. This should be evident from the “Sequence duplication levels” and the “Overrepresented Sequences” modules of the FastQC report. Keep in mind this application should not be used with RNA-seq data as it will remove observed differences in expression level.

This tool is based on Tally.

Filter by quality scores

Action: to discard reads from a Raw reads file based on Phred33 quality scores. The application classifies the sequence as pass or fail calculating quality score distribution for each read.

Command line options:

  1. Minimum quality score (Phred+33 range, 0… 41) is a quality cutoff value. A score of 20 means that there is a 1% chance that the corresponding base was called incorrectly by the sequencer. A score of 30 means a 0.1% chance of an incorrect base call. (default: 20)
  2. Percentage of bases to be above the minimum quality score is number of nucleotides in the reads having quality equal to or higher than the chosen minimum quality score. 100% requires all bases in the reads to be equal to or higher than the quality cut-off value. 50% means requires the median of the bases to be at least the quality cut-off value. (default: 80)

Let’s take an example to understand how the application works. So, here is our read:

../_images/filter_by_quality_scores_example.png

The second line represents the nucleotide sequence (10 bases in this case). The fourth line contains quality scores for each nucleotide in the read.

  • If the “Minimum quality score” is equal to 30 and the “Percentage of bases” is equal to 50, this read will not be discarded, because the median quality of the read is higher than 30.
  • If the “Minimum quality score” is equal to 20 and the “Percentage of bases” is equal to 100, the read will be discarded, because not all bases have quality equal to or higher than 20.

This tool is based on fastq_quality_filter, which is part of the FASTX-Toolkit.

This application is best used if you have some low quality reads, but others are of high quality. You should be able to tell if this is the case from the shape of the “Per sequence quality scores” plot from the FastQC application. It may also be worth trying this application if the per base sequence quality is low.

Trim adaptors and contaminants

Action: to find and trim adaptors and known contaminating sequences from raw reads data.

The application uses an internal list of sequences that can be considered as contaminants. This list is based on the possible primers and adaptors which the most popular sequencing technologies and platforms use. For instance, it contains widely used PCR primers and adaptors for Illumina, ABI etc. (see the list of primers and adaptors we remove).

The occurrence threshold before adapter clipping is set to 0.0001. It refers to the minimum number of times an adapter needs to be found before clipping is considered necessary.

Command line options:

Minimum length of the trimmed sequence (bp). The application will discard trimmed reads of length below this number. (default: 15)

This application is based on the fastq-mcf, one of the EA-Utils utilities.

The application is best used when you have irregularities in GC content, in base content at the start of reads, duplicated reads. Since this QC application relies on sequence matching it should be run first if used in conjunction with other QC applications.

Trim low quality bases

Action: to isolate high-quality regions from raw reads.

Trim Low Quality Bases application is based on the Phred algorithm. It finds the longest subsequence in read where the estimated error rate is below the error threshold (which is equal to 0.01 by default).

To understand how the application works let’s take an example. So, imagine we have a sequence:

../_images/trim_low_quality_bases_example.png

The application will find the fragment of the read where the sum of all probability errors will not be more than 0.01 (in our case). In this case, the best sequence will be “TAGA” (.001*2 + .0001*2 = .0022) and it will be the output read. Other fragments will have the sum of error probabilities more than the cutoff 0.01, so they will be ignored.

This tool is based on the Seqtk tool and uses the Phred algorithm to pick out the regions of the highest quality.

Trim reads to fixed length

Action: to trim a specific amount of bases from the extremities of all reads in a sample.

Command line options:

  1. The Keep bases from position option asks you to specify the first base that should be kept. (default: 1)
  2. Keep bases to position (set to zero for entire read). Indicate the position of the last nucleotide that should be kept in the read. (default: 0)

For example, if you set 5 as the first base to keep and 30 as the last base to keep, it means that the application trims all nucleotides before the 5th position, and all nucleotides after the 30th base.

This tool is based on the fastx_trimmer, which is part of the FASTX-Toolkit.

Trim Reads to Fixed Length application is helpful when you want to obtain reads of a specific length (regardless of the quality).

Mapped reads quality control and preprocessing

If you analysing mapped reads, we recommend you check if there are any biases taken place during the mapping process (e.g. low coverage, experimental artifacts, etc.) and do preprocessing of mapped reads.

Mapped reads QC report

Action: to perform quality control (QC) of mapped reads.

We follow a similar procedure to the one used to generate FastQC reports. After selecting the mapped reads we wish to check the quality of, we can run the Mapped Reads QC public data flow.

An individual Mapped Reads QC report contains some technical information about source data, tools used and data flow.

Also, it includes a range of Mapping statistics. For single reads, you will calculate these QC metrics:

  1. Total number of reads: how many reads used to map to the reference genome;
  2. Unmapped reads: total number of reads which failed to map to the reference genome;
  3. Mapped reads: total number of reads aligned to the reference genome;
  4. Uniquely mapped reads: total number of reads aligned exactly 1 time to the reference genome;
  5. Multi-hit mapped reads: total number of reads aligned >1 times to the reference genome.

In case you analyse paired-end reads data, you will see the following statistics:

  1. Total number of mate pairs: how many paired-end reads used to map to the reference genome;
  2. Mapped mate pairs: total number of paired reads where both mates were mapped;
  3. Partially mapped mate pairs: total number of paired reads where only one mate in the pair was mapped;
  4. Unmapped mate pairs: total number of paired reads which failed to map to the reference genome;
  5. Improperly mapped mate pairs: total number of paired reads where one of the mates was mapped with an unexpected orientation;
  6. Properly mapped mate pairs: total number of paired reads where both mates were mapped with the expected orientation.

Coverage by chromosome plot is reported for both read types.

../_images/coverage_by_chromosome.png

This plot shows the percentage of reads covered by at least x reads. To clear it up, let’s just imagine that we have a plot which shows coverage only for one chromosome and therefore it shows 1 line. If on the x-axis we have e.g. 100 reads, on y-axis — 10% (percentage of chromosome bases covered by 100 reads). So, it looks like we have 100-reads coverage for 10% of chromosome.

The amount of coverage you are expecting varies with the experimental techniques you are using. Normally you want similar coverage patterns across all chromosomes, but this may not be the case if e.g. you are dealing with advanced stage cancer.

Insert Size statistics will be calculated for paired-end reads only.

Note

What is the difference between fragment size, insert size and mate inner distance?

Mate inner distance is the length between the two sequence reads. Insert size is normally the distance between paired-end adaptors (paired-end reads + mate inner distance). Fragment size is the insert plus both adaptors.

../_images/insert.jpg

Insert size statistics are useful to validate library construction and include:

  1. Median insert size - a middle of a sorted list of insert sizes;
  2. Median absolute deviation is calculated by taking the median of the absolute deviations from the median insert size;
  3. Mean insert size (trimmed) - an average of the insert sizes;
  4. Standard deviation of insert size measures the variation in insert sizes from the mean insert size.

Insert size distribution graph is displayed for paired-end reads:

../_images/mapped_reads_qc_report_insert_size_distribution.png

This graph shows the distribution of insert sizes.

Of course, the expected proportions of these metrics vary depending on the type of library preparation used, resulting from technical differences between pair-end libraries and mate-pair libraries.

Mapped Reads QC Report application is based on the BEDtools and the Picard tools.

You can analyse the output for several Mapped Reads QC reports at once using our Multiple QC Report application.

../_images/multiple_qc_report_mapped_reads_qc.png

This is helpful, because it allows you to see in comparison, how many reads in your dataset are unmapped, partially or improperly mapped.

Targeted sequencing QC report

This application is good to use when analysing Targeted Sequencing data, e.g. Whole Exome Sequencing assays.

Action: to assess whether the target capture has been successful, i.e. if most of the reads actually fell on the target, if the targeted bases reached sufficient coverage, etc.

Options:

  1. Compute enrichment statistics based on option. The application allows you to compute enrichment statistics for reads mapped only on exome, only on target file, or both exome and target file. (default: Exome)

The following enrichment statistics are computed:

  • Number and proportion of mapped reads on target;
  • Mean coverage on target with at least 2X coverage;
  • Target bases with at least 2X, 10X, 20X, 30X, 40X, and 50X coverage.

You can generate these reports directly by choosing Mapped Reads datasets, right clicking on them and selecting the appropriate application (in “Explore” section) or using the Targeted Sequencing Quality Control public data flow.

You can analyse the output for multiple reports at once using the Multiple QC Report application.

../_images/targeted_sequencing_qc_multiple.png

This application is based on the BEDtools, Picard and SAMtools.

Apart from quality control applications, Genestack suggests you a bunch of applications to preprocess mapped reads.

Mark duplicated mapped reads

Duplicated reads are reads of identical sequence composition and length, mapped to the same genomic position. Marking duplicated reads can help speed up processing for specific applications, e.g. variant calling step, where processing additional identical reads would lead to early PCR amplification effects (jackpotting) contributing noise to the signal. You can read more about duplicated mapped reads in this excellent SeqAnswers thread.

Action: to go through all reads in a mapped reads file, marking as “duplicates” for paired or single reads where the orientation and the 5’ mapping coordinate are the same.

3’ coordinates are not considered due to two reasons:

  1. The quality of bases generated by sequencers tends to drop down toward the 3’ end of a read. Thus its alignment is less reliable compared to the 5’ bases.
  2. If reads are trimmed at 3’ low-quality bases before alignment, they will have different read lengths resulting in different 3’ mapping coordinates.

In such cases, when the distance between two mapped mates differs from the internally estimated fragment length, including mates mapping to different chromosomes, the application will not identify or use them but will not fail due to inability to find the mate pair for the reads.

Marking duplicated reads can help speed up processing for specific applications, e.g. Variant Calling application.

This tool is based on the MarkDuplicates, part of the Picard tool.

Remove duplicated mapped reads

The point of removing duplicated mapped reads is to try to limit the influence of early PCR selection (jackpotting). Whether or not you should remove duplicate mapped reads depends on the type of data you have. If you are dealing with whole-genome sequencing data where expected coverage is low and sequences are expected to be present in similar amounts, removing duplicated reads will reduce processing time and have little deleterious effect on analysis. If however you are processing RNA-seq data, where the fold-variation in expression can be up to 10^7, reads are relatively short, and your main point of interest is the variation in expression levels, this probably is not the tool for you. You can read more about duplicated mapped reads in this excellent SeqAnswers thread.

Action: to go through all reads in a Mapped Reads file, marking as “duplicates” paired or single reads where the orientation and the 5’ mapping coordinate are the same and discarding all except the “best” copy.

3’ coordinates are not considered due to two reasons:

  1. The quality of bases generated by sequencers tends to drop down toward the 3’ end of a read. Thus its alignment is less reliable compared to the 5’ bases.
  2. If reads are trimmed at 3’ low-quality bases before alignment, they will have different read lengths resulting in different 3’ mapping coordinates.

The application also takes into account interchromosomal read pairs. In such cases, when the distance between two mapped mates differs from the internally estimated fragment length, including mates mapping to different chromosomes, the application cannot identify them but will not fail due to inability to find the mate pair for the reads.

This application is based on the MarkDuplicates, part of the Picard tools.

Subsample reads

You can use this application, for example, if you want to take a look at what your final experimental results will look like, but do not want to spend time processing all your data right away.

Action: to create a random subset of mapped reads.

Command line options

  1. The Subsampling ratio (percentage) option is used to set a fraction of mapped reads you would like to extract (default: 50).
  2. The Random seed option will let you produce different subsets with the same number of mapped reads. (default: 0)

Using the same random seed and the same subsampling ratio will result in identical subsets.

This application is based on the SAMtools.

Merge mapped reads

The application is useful when you have multiple replicates of the same experiment and want to combine them before producing your final result.

Action: to merge multiple Mapped Reads files, producing one single output Mapped Reads file.

The application is based on the SAMtools.

Convert to unaligned reads

The application will be very useful when you are interested in fraction of reads that exactly will map to genome or when you would like to remap the reads with other aligner.

Action: to convert a Mapped reads file to raw reads.

This application is based on the Picard toolkit.

Variants preprocessing

While analysing variants, you also can preprocess them. Just select Genetic Variations file and click on “Preprocess” section to see what applications are available for you.

Merge variants

Merging variants can be useful, when you have, for example, one Genetic Variations file for SNPs and another one for Indels. After their merging, the result Genetic Variations file will separately contain information about SNPs and about Indels.

Action: to merge two or more Genetic Variations files into a single file.

Make sure that the same reference genome is specified in the metainfo of the selected Genetic Variants files.

This application is based on the BCFtools.

Concatenate variants

Concatenation would be appropriate if you, for example, have separate Genetic Variations files for each chromosome, and simply wanted to join them “end-to-end” into a single Genetic Variations file.

Action: to join two or more Genetic Variations files by concatenating them into a larger, single file.

Make sure that the same reference genome is specified in the metainfo of Genetic Variants files you wish to concatenate.

The application always allows overlaps so that the first position at the start of the second input will be allowed to come before the last position of the first input.

Command line options:

  1. The Remove duplicated variants option checks for the duplicated variants and makes sure that there are no redundant results. (default: unchecked)

The application is based on the BCFtools.

RNA-seq data analysis

Mapping (also called alignment) refers to the process of aligning sequencing reads to a reference sequence, whether the reference is a complete genome, transcriptome, or de novo assembly.

Note

What is the difference between genome, exome and transcriptome?

Genome includes both coding (genes) and noncoding DNA in a given cell type.

Exome is a part of genome formed by exons, i.e it includes all DNA that is transcribed into mRNA.

Transcriptome is a collection of all mRNAs present in a given cell type. In comparison to the genome, the transcriptome is dynamic in time (within the same cell type) in response to both internal and external stimuli. Thus, the transcriptome derived from any one cell type will not represent the entire exome, i.e. all cells my have essentially the same genome/exome, but not all genes are expressed in a specific cell type.

There are at least two types of mapping strategies — spliced mapping and unspliced mapping. In case of RNA-seq data, reads are derived from mature mRNA, so there is typically no introns in the sequence. For example, if the read spans two exons, the reference genome might have one exon followed by an intron.

Note

What is the difference between exons and introns?

Exons and introns are both parts of genes. However, exons code for proteins, whereas introns do not. In RNA splicing, introns are removed and exons are jointed to produce mature messenger RNA (mRNA) which is further used to synthesize proteins.

In this case, if you will use Unspliced mapper, the reference genome would find a matching sequence in only one of the exons, while the rest of the read would not match the intron in the reference, so the read cannot be properly aligned. When analysing RNA-seq data using unspliced aligner, the reads may be mapped to potentially novel exons, however reads spanning splice junctions are likely to remain unmapped.

In contrast, Spliced mappers would know not to try to align RNA-seq reads to introns, and would somehow identify possible downstream exons and try to align to those instead ignoring introns altogether. Taking this into account, we recommend you use Spliced mapping applications to analyse RNA-seq data.

On Genestack, you will find two spliced aligners - “Spliced mapping with Tophat2” and “Spliced mapping to transcriptome with STAR”.

Spliced mapping with Tophat2

Action: to map raw reads with transcriptomic data like RNA-seq to a reference genome, taking or not taking into account splice junctions.

Note

What is a splice junction?

Splice junctions are exon-intron boundaries, at which RNA splicing takes place. For example, to cut an intron (between two exons) you need to splice in two places so that two exons might be jointed.

Details on various settings:

  1. Strand-specificity protocol. If you are using strand-specific RNA-seq data, this option will let you choose between the “dUTP” and “ligation” method. If you are not sure whether your RNA-seq data is strand-specific or not, you can try using Subsample Reads application to make a small subsample, map it with Spliced Mapping with Tophat2 and check the coverage in Genome Browser for genes on both strands. (default: None)
  2. Rule for mapping over known annotations. This option allows you to use annotated transcripts from the reference genome to distinguish between novel and known junctions (“Yes, and discover novel splice junctions”). Also, you can restrict mappings only across known junctions (“Yes, without novel splice junctions discovery”) or infer splice junctions without any reference annotation (“Do not use known annotations”). (default: “Yes, and discover novel splice junctions”)
  3. Rule for filtering multiple mappings. If you set “Unique mappings only”, the application will report only unique hits for one mappable read. If you are interested in reads mapped to multiple positions in the genome, choose “Multiple mappings only”. Select “None”, if you would like to get both unique and multiple mappings. (default: None)
  4. The Number of best mappings to report option lets you increase the number of reported mappings. This can be used together with “Rule for filtering mappings” to choose whether to keep reads mapping to uniquely or to multiple positions, e.g. report up to 5 possible mappings, and only for multi-hit reads. (default: 1)
  5. The Number of allowed mismatches option lets you set the maximum number of allowed mismatches per read. (default: 2)
  6. The Disallow unique mappings of one mate option allows you to discard pairs of reads where one mate maps uniquely and the other to multiple positions. (default: unchecked)
  7. The Disallow discordant mappings will discard all mappings where the two mates map uniquely but with unexpected orientation, or where the distance between two mapped mates differs from and internally estimated fragment length, including mates mapping to different chromosomes. (default: unchecked)

The application is based on the Tophat2 aligner and used in the Testing Differential Gene Expression tutorial.

Spliced mapping to transcriptome with STAR

Action: to perform gapped read alignment of transcriptomic data to a Reference Genome taking into account splice junctions.

In comparison to Tophat2, STAR works fast, at the same time being very accurate and precise. Moreover, in contrast to all our other mappers, it maps reads onto the reference transcriptome, not the genome. Another advantage of the application is that it can be used to analyse both: short and long reads, making it compatible with various sequencing platforms. What’s more, this Spliced Mapper supports two-pass alignment strategy when it runs the second alignment pass to align reads across the found splice junctions, which improves quantification of the novel splice junctions. Taking all these features into account, the “Spliced Mapping to Transcriptome with STAR” application can be a very good alternative to other RNA-seq aligners.

Now, let’s look through the application parameters:

  1. The Enable two pass mapping mode option is recommended for sensitive novel junction discovery. The idea is to collect the junctions founded in the first pass, and use them as “annotated” junctions for the 2nd pass mapping. (default: unchecked)
  2. Maximum number of multiple alignments allowed for a read: if exceeded, the read is considered unmapped. This option allows you to set how many mappings you expect for one mappable read if it is mapped to multiple positions of the genome. (default: 10)
  3. The Minimum overhang for unannotated junctions option prohibits alignments with very small splice overhangs for unannotated junctions (overhang is a piece of the read which is spliced apart). (default: 5)
  4. The Minimum overhang for annotated junctions option does the same job as “Minimum overhang for unannotated junctions” but for annotated junctions. (default: 3)
  5. The Maximum number of mismatches per pair parameter sets how many mismatches you allow per pair. (default: 10)
  6. Minimum intron length is a minimum intron size for the spliced alignments. Read the paper in case you are not sure about the value. (default: 21)
  7. Maximum intron length is a maximum intron size you consider for the spliced alignments. For example, set 1,000 and the application will take into account the introns of maximum 1,000 bp in size. Note, that the default 0 here means the max intron size equal about 590,000 bp. If you are not sure about intron size value, the paper may help you to make a decision. (default: 0)
  8. Maximum genomic distance between mates is a maximum gap between reads from a pair when mapped to the genome. If reads map to the genome farther apart the fragment is considered to be chimeric. (default: 0)

The application is based on the STAR aligner.

Gene quantification with RSEM

Action: to use STAR mapper to align reads against reference transcripts and apply the Expectation-Maximization algorithm to estimate gene and isoform expression levels from RNA-seq data.

Command line options:

  1. The RNA-seq protocol used to generate the reads is strand specific. If the reads are strand-specific, check this option. (default: unchecked)
  2. Estimated average fragment length (for single-end reads only) option. It is important to know the fragment length distribution to accurately estimate expression levels for single-end data. Typical Illumina libraries produce fragment lengths ranging between 180–200 bp. For paired-end reads, the average fragment length can be directly estimated from the reads. (default: 190)
  3. Estimated standard deviation of fragment length (for single-end reads only) option. If you do not know standard deviation of the fragment library, you can probably assume that the standard deviation is 10% of the average fragment length. For paired-end reads this value will be estimated from the input data. (default: 20)

When the task is complete, click View report in Explore section to get gene and isoform level expression estimates.

../_images/rsem_output_report.png

The output report represents a table with the following main columns:

  • transcript_id - name of the transcript;
  • gene_id — name of the gene which the transcript belongs to. If no gene information is provided, gene_id and transcript_id are the same;
  • length — transcript’s sequence length (poly(A) tail is not counted);
  • effective_length — counts only the positions that can generate a valid fragment. If no poly(A) tail is added, effective length is equal to transcript length — mean fragment length + 1. If one transcript’s effective length is less than 1, this transcript’s both effective length and abundance estimates are set to 0;
  • expected_count — the sum of the posterior probability of each read comes from this transcript over all reads;
  • TPM — transcripts per million normalized by total transcript count in addition to average transcript length;
  • FPKM — fragments per kilobase of exon per million fragments mapped;
  • IsoPct — the percentage of the transcript’s abundance over its parent gene’s abundance. If the parent gene has only one isoform or the gene information is not provided, this field will be set to 100.

The application is based on the RSEM program and the STAR mapper.

Quantify raw coverage in genes

Action: to compute gene counts from mapped reads. The application takes as input a mapped reads file, and uses a reference genome to produce a mapped reads counts file, indicating how many reads overlap each gene specified in the genome’s annotation.

Let’s go through the application parameters:

  1. Feature type option. Depending on your tasks, you should specify the feature type for which overlaps choosing from “exon”, “CDS” (coding DNA sequence), “3’UTR” (the 3’ untranslated region) or “5’UTR” (the 5’untranslated region). For example, you may consider each exon as a feature in order to check for alternative splicing. By default, the “gene-id” will be used as a feature identifier. (default: exon)

  2. The Rule for overlaps option dictates how mapped reads that overlap genomic features will be treated. There are three overlap resolution modes: union, strict-intersection, and non-empty intersection. (default: union)

    The first one - “union” - is the most recommended. It combines all cases when the read (or read pair) at least partly overlaps the feature. The “strict-intersection” mode is about strict intersection between the feature and the read overlapping this feature. But if you are interested in counting reads that are fully or partly intersected with the feature, you should use the last mode. It is important that the read will be counted for feature if it overlaps precisely only one feature. If the read overlaps with more than one feature, it will not be counted.

../_images/overlap_resolution_modes.png
  1. Strand-specific reads. The application takes into account the direction of the read and the reference, so that a read from the wrong direction, even if it is mapped to the right place, will not be counted. This option can be useful if your data is strand-specific and you are interested in counting of reads overlapping with feature regarding to whether these reads are mapped to the same or the opposite strand as the feature. Choose “Yes”, if the reads were mapped to the same strand as the feature and “Reverse” — if the reads were mapped on the opposite strand as the feature. Specify “No”, if you do not consider strand-specificity. (default: Yes)

This application is based on the HTSeq tool and used in Differential Gene Expression Analysis pipeline. After calculating read abundance on the gene level, you will be able to run Test Differential Gene Expression application.

Isoform quantification with Salmon

Specific genes can produce a range of different transcripts encoding various isoforms, i.e. proteins of varying lengths containing different segments of the basic gene sequence. Such isoforms can be generated, for example, in the process of alternative splicing.

Action: to quantify abundances of transcripts from RNA-seq data. The application requires a set of reference transcripts and uses the concept of quasi-mapping to provide accurate expression estimates very quickly.

The application is based on the Salmon tool.

Isoform quantification with Kallisto

Specific genes can produce a range of different transcripts encoding various isoforms, i.e. proteins of varying lengths containing different segments of the basic gene sequence. Such isoforms can be generated, for example, in the process of alternative splicing.

Action: to quantify abundances of genes and isoforms from RNA-seq data without the need for alignment. It uses the Expectation-Maximization algorithm on “pseudoalignments” to find a set of potential transcripts a read could have originated from. Note, that the application accepts reference transcriptome (cDNA) not a genome (DNA).

Let’s inspect the application options:

  1. The Strand-specificity protocol parameter is used to specify how to process the pseudoalignments. If “None”, the application does not take into account strand specificity. To run the application in strand specific mode, change this value to “Forward” if you are interested only in fragments where the first read in the pair is pseudomapped to the forward strand of a transcript. If a fragment is pseudomapped to multiple transcripts, only the transcripts that are consistent with the first read are kept. The “Reverse” is the same as “Forward” but the first read will be pseudomapped to the reverse strand of the transcript. (default: None)
  2. The Enable sequence based bias correction option will correct the transcript abundances according to the model of sequences specific bias. (default: checked)
  3. The Estimated average fragment length (for single-end reads only) option must be specified in case of single-end reads. Typical Illumina libraries produce fragment lengths ranging from 180–200 bp. For paired-end reads, the average fragment length can be directly estimated from the reads. (default: 190)
  4. Estimated standard deviation of fragment length (for single-end reads only) option. If you do not know the standard deviation of the fragment library, you can probably assume that the standard deviation is 10% of the average fragment length. For paired-end reads, this value will be estimated from the input data. (default: 20)

Use the View report application in the Explore section to review the Kallisto output report.

../_images/kallisto_report.png

It contains a table with the following main columns:

  • target_id — feature name, e.g. for transcript, gene;
  • length — feature length;
  • eff_length — effective feature length, i.e. a scaling of feature length by the fragment length distribution;
  • est_counts — estimated feature counts;
  • tpm — transcripts per million normalized by total transcript count in addition to average transcript length.

The application is based on the Kallisto tool.

Quantify FPKM coverage in isoforms

Specific genes can produce a range of different transcripts encoding various isoforms, i.e. proteins of varying lengths containing different segments of the basic gene sequence. Such isoforms can be generated, for example, in the process of alternative splicing.

Action: to quantify reads abundance at the isoform level. It accepts mapped reads (corresponding to isoform alignment) and reference genome as inputs. The output is a file containing isoform counts. Several such files corresponding to samples with different biological conditions and isoforms can be further used in Test Differential Isoforms Expression application.

Before running the application, you can choose the following parameters:

  1. The Strand-specificity protocol is used for generating your reads. If “None”, the application will consider your data as none-strand-specific, but this value can be changed to “dUTP” or “RNA-ligation”. (default: None)
  2. The No correction by effective length option is used if you would like to not apply effective length normalization to transcript FPKM (fragments per kilobases of exons for per million mapped reads). (default: unchecked)

The application always makes an initial estimation procedure to more accurately weight reads mapping to multiple places in the genome.

This application is based on the cuffquant (a part of the Cufflinks tool) and used in Differential Isoform Expression Analysis public data flow.

Test differential gene expression

Action: to perform differential gene expression analysis between groups of samples. The application accepts Mapped Read Counts (from the “Quantify Raw Coverage in Genes” application) and generates Differential Expression Statistics file which you can view with the Expression Navigator application.

Options:

  1. The “Group samples by” option allows you to apply autogrouping, i.e. when the application helps you to group your samples according to experimental factor indicated in metainfo for the samples (e.g. disease, tissue, sex, cell type, cell line, treatment, etc.). (default: None)
  2. Method for differential expression. The application supports two methods — “DESeq2” and “edgeR” statistical R packages — to perform normalization across libraries, fit negative binomial distribution and likelihood ratio test (LRT) using generalized linear model (GLM). (default: DESeq2)

With edgeR, one of the following types of dispersion estimate is used, in order of priority and depending on the availability of biological replicates: Tagwise, Trended, or Common. Also, edgeR is much faster than DESeq2 for fitting GLM model, but it takes slightly longer to estimate the dispersion. It is important that edgeR gives moderated fold changes for the extremely lowly Differentially Expressed (DE) genes which DESeq2 discards, showing that the likelihood of a gene being significantly differentially expressed is related to how strongly it is expressed. So, choose one of the packages according to your desires and run the analysis.

For each group, a GLM LRT is carried out to find DE genes in this group compared to the average of the other groups. In the case of 2 groups, this reduces to the standard analysis of finding genes that are differentially expressed between 2 groups. Thus, for N groups, the application produces N tables of Top DE genes. Each table shows the corresponding Log2(Fold Change), Log2(Counts per Million), p-value, and False Discovery Rate for each gene. Look at all result tables and plots in Expression Navigator application.

  • log-fold change: the fold-change in expression of a gene between two groups A and B is the average expression of the gene in group A divided by the average expression of the gene in group B. The log-fold change is obtained by taking the logarithm of the fold change in base 2.
  • log-counts per million: dividing each read count by the total read counts in the sample, and multiplying by 10^6 gives counts per million (CPM). log-counts per million are obtained by taking the logarithm of this value in base 2.
  • p-value. The application also computes a p-value for each gene. A low p-value (typically, < 0.005) is viewed as evidence that the null hypothesis can be rejected (i.e. the gene is differentially expressed). However, due to the fact that we perform multiple testing, the value that should be looked at to safely assess significance is the false discovery rate.
  • False discovery rate. The FDR is a corrected version of the p-value, which accounts for multiple testing correction. Typically, an FDR < 0.05 is good evidence that the gene is differentially expressed. You can read more about it here.

This application is based on DESeq2 and edgeR R packages.

Test differential isoform expression

Action: to perform differential isoform expression analysis between groups of samples. The application accepts FPKM Read Counts (from Quantify FPKM Coverage in Isoforms application) and generates Differential Expression Statistics file which you can view in Expression Navigator application.

The application has the following options:

  1. The “Group samples by” option allows you to apply autogrouping, i.e. when the application helps you to group your samples according to experimental factor indicated in metainfo for the samples (e.g. disease, tissue, sex, cell type, cell line, treatment, etc.). (default: None)
  2. Apply fragment bias correction option - if checked, the application will run the bias detection and correction algorithm which can significantly improve accuracy of transcript abundance estimates. (default: checked)
  3. The Apply multiple reads correction option is useful if you would like to apply the multiple reads correction. (default: checked)

The application finds isoforms that are differentially expressed (DE) between several groups of samples and produces tables of Top DE transcripts. Each table shows the corresponding Log2(Fold Change), Log2(Counts per Million), p-value, and False Discovery Rate for each isoform. Use the Expression Navigator to visualize the results.

  • log-fold change: the fold-change in expression of a gene between two groups A and B is the average expression of the gene in group A divided by the average expression of the gene in group B. The log-fold change is obtained by taking the logarithm of the fold-change in base 2.
  • log-counts per million: dividing each read count by the total read counts in the sample, and multiplying by 10^6 gives counts per million (CPM). log-counts per million are obtained by taking the logarithm of this value in base 2.
  • p-value. The application also computes a p-value for each isoform. A low p-value (typically, < 0.005) is viewed as evidence that the null hypothesis can be rejected (i.e. the isoform is differentially expressed). However, due to the fact that we perform multiple testing, the value that should be looked at to safely assess significance is the false discovery rate.
  • False discovery rate. The FDR is a corrected version of the p-value, which accounts for multiple testing correction. Typically, an FDR < 0.05 is good evidence that the isoform is differentially expressed. You can read more about it here.

This application is based on the cuffdiff which is a part of the Cufflinks tool.

Expression navigator

Action: to view and filter the results of differential gene and isoform expression analyses.

../_images/expression_navigator_for_RNA-seq.png

The Expression Navigator page contains four sections:

  1. Groups Information section. It is a summary of the groups available for comparison. Size refers to the number of samples used to generate each group.
../_images/expression_navigator_group_information.png
  1. The Top Differentially Expressed Genes section allows you to choose which groups to compare and how to filter and sort identified differentially expressed (DE) genes.
../_images/expression_navigator_top_de_genes.png

You can filter DE genes by maximum acceptable false discovery rate (FDR), up or down regulation, minimum log fold change (LogFC), and minimum log counts per million (LogCPM).

../_images/expression_navigator_de_genes_filtering.png

Let’s look through these statistics:

  • log-fold change: the fold-change in expression of a gene between two groups A and B is the average expression of the gene in group A divided by the average expression of the gene in group B. The log-fold change is obtained by taking the logarithm of the fold change in base 2. Log transformed values contains the same information as fold change but makes it more clear for interpretation because of symmetric values. Genes with positive log FC are considered to be up-regulated in the selected group, ones with negative log FC are down-regulated.
  • log-counts per million: dividing each read count by the total read counts in the sample, and multiplying by 10^6 gives counts per million (CPM). Log-counts per million are obtained by taking the logarithm of this value in base 2.
  • p-value. The application also computes a p-value for each gene. A low p-value (typically, < 0.005) is viewed as evidence that the null hypothesis can be rejected (i.e. the gene is differentially expressed). However, due to the fact that we perform multiple testing, the value that should be looked at to safely assess significance is the false discovery rate.
  • False discovery rate. The FDR is a corrected version of the p-value, which accounts for multiple testing correction. Typically, an FDR < 0.05 is good evidence that the gene is differentially expressed. You can read more about it here.

Moreover, you can sort the DE genes by these statistics by clicking the arrows next to the name of the metrics in the table headers.

../_images/expression_navigator_de_genes_sorting.png

The buttons at the bottom of the section allow you to update the list based on your filtering criteria or clear your selection.

  1. The top-right section contains a boxplot of expression levels. Each colour corresponds to a gene. Each boxplot corresponds to the distribution of a gene’s expression levels in a group, and coloured circles represent the expression value of a specific gene in a specific sample.
../_images/expression_navigator_de_boxplots.png
  1. The bottom-right section contains a search box that allows you to look for specific genes of interest. You can look up genes by gene symbol, with autocomplete. You can search for any gene (not only those that are visible with the current filters).
../_images/expression_navigator_de_search_box.png

You can read more about this application in the corresponding tutorials.

Single-cell RNA-seq analysis

Action: to identify heterogeneously-expressed (HE) genes across cells, while accounting for technical noise. The application analyses single-cell RNA-seq data and accepts several Mapped Read Counts as inputs. The output report can be opened in Single-cell RNA-seq Visualiser.

The application supports two algorithms for heterogeneity analysis. The first uses spike-in data (artificially introduced RNAs of known abundance) to calibrate a noise model. The second method is a non-parametric algorithm based on smoothing splines and does not require the presence of spike-in data.

To identify highly variable genes you can try different options:

  1. The Use spike-ins to calibrate noise option determines whether or not spike-in data should be taken into account. If you select only one folder before running the application, you will use spike-free algorithm and this option will be switched off by default. But if you select two folders, one for biological and the other for spike-in data, you can use the Brennecke algorithm which requires this option.
  2. The Exclude samples with low coverage option allows you to exclude or include for analysis samples with low read counts. (default: checked)
  3. Significance level for the p-value (-10log₁₀(p)). If you set it equal to 1, the application will select the genes for which the p-value is smaller than 0.1. (default: 1)

The next three options will be available if spike-ins are included in the experiment and “Use spike-ins to calibrate noise” option is switched:

  1. The Expected biological CV is the minimum threshold chosen for quantifying the level of biological variability (CV — coefficient of variation) expected in the null hypothesis of the model. (default: 0.5)
  2. The Noise fit - proportion of genes with high CV² to remove option allows you to exclude spike-in genes with high CV² to fit the noise model. (default: 0)
  3. The Noise fit - proportion of genes with low mean expression to remove option enables you to exclude a fraction of spike-in genes with low mean expression to fit the noise model, because extreme outliers tend to skew the fit. (default: 0.85)

To look at the HE analysis results, open the created Single-cell RNA-seq Analysis page in  Single-cell RNA-seq Visualiser.

This application is based on such R packages as DESeq, statmod, ape, flashClust and RSJONIO.

Read more about single-cell RNA-seq analysis on Genestack.

Single-cell RNA-seq visualiser

Action: to explore cell-to-cell variability in gene expression in even seemingly homogeneous cell populations based on scRNA-seq datasets.

The application shows basic statistics such as the number of identified highly variable genes across the analysed samples.

../_images/sc-rna-seq_basic_statistics.png

It also provides several quality control (QC) plots allowing to check the quality of raw sequencing data, estimate and fit technical noise for the Brennecke algorithm, and detect genes with significantly high variability in expression.

../_images/qc_plots_in_single_cell_visualizer.png

QC plots are adopted from the original paper by Brennecke et al. In all the plots described below, gene expression levels are normalized using the DESeq normalization procedure.

The first plot describing the quality of raw data is the Scatter Plot of Normalised Read Counts, which shows the cell-to-cell correlation of normalized gene expression levels. Each dot represents a gene, its x-coordinate is the normalized gene count in the first cell, and its y-coordinate is the normalized gene count in the second cell. If spike-ins were used during the analysis, separate plots will be rendered for spike-in genes and for sample genes.

../_images/sc-rna-seq_qc_raw.png

The Technical Noise Fit and Highly Variable Genes plots provide a visual summary of the gene expression noise profile in your dataset across all cells.

../_images/sc-rna-seq_technical_noise_fit_and_variable_genes.png

They graph the squared coefficient of variation (CV²) against the average normalized read counts across samples. The Gene Expression Variability QC plot allows you to visualize the genes whose expression significantly varies across cells. A gene is considered as highly variable if its coefficient of biological variation is significantly higher than 50% (CV² > 0.25) and the biological part of its coefficient of variation is significantly higher than a user-defined threshold (its default value is 50%, and can be modified in the Single-cell Analyser). The coefficient of variation is defined as the standard deviation divided by the mean. It is thus a standardized measure of variance.

If spike-ins were used to calibrate technical noise, then the separate Technical Noise Fit plot is displayed. On this plot, each dot corresponds to a “technical gene” (spike-in gene).It plots the mean normalized count across all samples on the x-coordinate and the squared coefficient of variation (CV²) of the normalized counts across all samples on the y-coordinate. The coefficient of variation is defined as the standard deviation divided by the mean. It is thus a standardized measure of variance. The plot also represents the fitted noise model as a solid red line (with 95% confidence intervals as dotted red lines). It allows you to check whether the noise model fits the data reasonably well. If it is not the case, you should change the noise fitting parameters in the Single-cell Analysis application.

Expression of the highly variable genes across all cell samples is represented by an interactive clustered heatmap.

../_images/heatmap_single_cell_visualizer.png

The interactive heatmap depicts the log normalised read count of each significant highly variable gene (rows) in each cell sample (columns). Hierarchical clustering of molecular profiles from cell samples is based on the similarity in gene expression of highly expressed genes and allows identification of molecularly distinct cell populations. The heatmap is clustered both by columns and by rows, to identify clusters of samples with similar gene expression profiles, and clusters of potentially co-expressed genes. The bi-clustered heatmap is provided by an open source interactive Javascript library InCHlib (Interactive Cluster Heatmap library).

Finally, several plots in the Samples Visualisation section can be used to detect cell subpopulations and identify novel cell populations based on gene expression heterogeneity in the single-cell transcriptomes.

../_images/clustering_single_cell_visualizer.png

The Samples Visualisation section provides interactive plots used to cluster cell samples based on expression of highly variable genes. Currently, two alternative methods are supported for visualisation and clustering of samples: the first one is based on the t-distributed Stochastic Neighbour Embedding (t-SNE) algorithm and the second one uses Principal Component Analysis (PCA).

For automatic cluster identification, the k-means clustering algorithm can be used in combination with either  t-SNE or PCA. K-means clustering requires you to supply a number of clusters to look for (“k”). You can either enter it manually using the dropdown menu or use the suggested value estimated using the “elbow” method (choosing a value of k such that increasing the number of clusters does not significantly reduce the average “spread” within each cluster).

The Interactive Principal Component Analysis (PCA) scatter plot is rendered using the NVD3 Javascript library. The PCA features and k-means algorithm results are computed using R’s built-in functions prcomp and knn. The t-SNE transformation is computed using the Rtsne package.

Read our blog post about the application and single-cell RNA-seq analysis.

Genome/exome sequencing data analysis

Mapping (also called alignment) refers to the process of aligning sequencing reads to a reference sequence, whether the reference is a complete genome, transcriptome, or de novo assembly.

There are at least two types of mapping strategies — Spliced Mapping and Unspliced Mapping. In contrast to spliced aligners, unspliced read aligners map reads to a reference without allowing large gaps such as those arising from reads spanning exon boundaries, or splice junctions. When analysing whole genome sequencing (WGS) or whole exome sequencing (WES) data, there is no need to look for spliced these sites precisely. That is why we recommend use unspliced mapping applications in such cases.

On Genestack, you will find two unspliced aligners — Unspliced Mapping with BWA and Unspliced Mapping with Bowtie2.

Unspliced mapping with BWA

Action: to map WES or WGS data to a reference genome without allowing splice junctions. The application generates Mapped Reads which can be used further with our Variant Calling application which is based on samtools mpileup.

BWA’s MEM algorithm will be used to map paired or single-ends reads from 70 bp up to 1Mbp (“mem” option in command line). For reads up to 70 bp the algorithm called BWA-backtrack will be applied. This algorithm is implemented with the “aln” command, which produces the suffix array (SA) coordinates of the input reads. Then the application converts these SA coordinates to chromosome coordinates using the “samse” command (if your reads are single-end) or “sampe” (for paired-end reads).

Command line options:

  1. Perform targeted mapping option. If this parameter is selected, a BED file (“Target region” source file) is used to restrict mapping of the reads to specific locations in the genome, that the reads should be mapped to. The reference genome is altered to only contain those locations, using the bedtools “getfasta” command and the reads are then mapped to the altered genome. The resulting sam file contains local genome co-ordinates, which are converted back to the global coordinates of the reference genome. (default: unchecked)

The application is based on the BWA aligner and BEDtools. The application is used in Whole Exome Sequencing Data Analysis and Whole Genome Sequencing Data Analysis tutorials.

Unspliced mapping with Bowtie2

Action: to map WES or WGS data to a reference genome without allowing splice junctions. The application generates Mapped Reads which can be used further with our Variant Calling application which is based on samtools mpileup.

Let’s look at the parameters we can use to do mapping:

  1. Report the best mapping option. The application will consider only the best mapping for one mappable read. (default: checked)
  2. Limit the number of mappings to search option. If you are interested in reads mapping to multiple positions, switch off “Report the best mapping” option and set N mappable positions for one read in the text box for “Limit the number of mappings to search”. (default: 1)
  3. Rule for filtering mappings. You can apply a rule for filtering mappings to choose whether to keep reads mapping uniquely or to multiple positions. (default: None)
  4. Number of allowed mismatches option. If you want to be stricter, you can change the maximum number of allowed mismatches, e.g. if you set it to 1, any mapping with 2 or more mismatches will not be reported (default: 0)

For paired-end reads two more option appears:

  1. The Disallow unique mappings of one mate option allows you to discard pairs of reads where one mate maps uniquely and the other to multiple positions. (default: unchecked)
  2. The Disallow discordant mappings parameter will discard all mappings where the two mates map uniquely but with unexpected orientation or where the distance between two mapped mates differs from and internally estimated fragment length, including mates mapping to different chromosomes. (default: unchecked)

The application is based on the Bowtie2 aligner.

Variant calling with SAMtools and BCFtools

Action: to identify genomic variants. The application accepts Mapped Reads files to call variants. You will be able to perform variant calling for each single Mapped Reads file separately or run Variant Calling application on multiple mapped reads samples. The last option maybe helpful because you increase the accuracy of the analysis by taking the reads from several samples into consideration and reducing the probability of calling sequencing errors. After the variants are detected you can annotate them running Effect Prediction application or/and use Genome Browser and Variant Explorer for exploring the results.

The application uses samtools mpileup which automatically scans every position supported by an aligned read, computes all the possible genotypes supported by these reads, and then calculates the probability that each of these genotypes is truly present in your sample.

As an example, let’s consider the first 1000 bases in a Reference Genome file. Suppose the position 35 (in reference G) will have 27 reads with a G base and two reads with a T nucleotide. Total read depth will be 29. In this case, the application concludes with high probability that the sample has a genotype of G, and the T reads are likely due to sequencing errors. In contrast, if the position 400 in reference genome is T, but it is covered by 2 reads with a C base and 66 reads with a G (total read depth equal to 68), it means that the sample more likely will have G genotype.

Then the application executes bcftools call which uses the genotype likelihoods generated from the previous step to call and filter genetic variants and outputs the all identified variants in the Genetic Variations file.

Let’s now look at the command line options more closely:

  1. Variants to report option. The application can call both “SNPs and INDELs” variants, “SNPs only” or “INDELs only”. (default: “SNPs and INDELs”)
  2. Call only multi-allelic variants option. The multiallelic calling is recommended for most tasks. (default: checked)

Note

What is a multiallelic variant?

A multiallelic variant is a genomic variant with two or more observed alleles in the variant locus. In contrast to multiallelic variant, consensus (or biallelic) variant is determined as a single non-reference allele (there are only two possible alleles at the variant site - the reference allele and the consensus one).

  1. Only report variant sites option. In some cases, it’ll be interested to report only potential variant sites and exclude monomorphic ones (sites without alternate alleles). For this purpose, switch the option “Only report variant sites”. (default: checked)
  2. The Discard anomalous read pairs option is used to skip anomalous read pairs in variant calling. (default: checked)
  3. The Maximum per-sample read depth to consider per position option sets the maximum number of reads at the position to consider. (default: 250)
  4. Minimum number of gapped reads for an INDEL candidate option. Typically, gapped alignments (like the ones from Unspliced with Bowtie2) can be used to identify indels (about 1-10 bases in length). The greatest indel sensitivity can be achieved by generating indel candidate from mapped reads. (default: 1)
  5. Minimum per-sample depth to call non-variant block option. A non-variant block is all variants, describing a segment of nonvariant calls. Specify, what minimum read depth value you expect to observe among all sites encompassed by the non-variant block. (default: 1)
  6. Minimum variant quality option. The application will ignore the variant with quality score below this value. (default: 20)
  7. The Minimum average mapping quality for a variant parameter is used to discard all variants with average mapping quality value less than specified. (default: 20)
  8. The Minimum all-samples read depth for a variant is a minimum number of reads covering position. (default: 1)
  9. The Chromosome to analyse option allows you to choose specific chromosomes to analyse. (default: All)
  10. Key to merge samples is a metainfo key you need to specify in order you would like to merge the samples. This option can be useful for merging technical replicates.

Moreover, base alignment quality (BAQ) recalculation is turned on by default. It helps to rule out false positive SNP calls due to alignment artefacts near small indels.

Also, the application will always write DP (number of reads covering position), DV (number of high-quality variant reads), DP4 (number of forward reference, reverse reference, forward non-reference and reverse non-reference alleles used in variant calling) and SP (phred-scaled strand bias p-value) tags in the output file.

The result Genetic Variations can be explored in Genome Browser as a separate  variation track, further annotated using Effect Prediction application, or viewed immediately using Variant Explorer application.

This application is based on the SAMtools and BCFtools utilities and best used when performing Whole Exome Sequencing Analysis or Whole Genome Sequencing Analysis.

Effect prediction with SnpEff

Action: to annotate variants based on their genomic locations and calculate the effects they produce on known genes. The application accepts Genetic Variations and adds annotations for them.

The annotated variants can be further explored in Genome Browser, Variant Explorer or View Report applications.

In Genome Browser, the Variation track shows the genetic variants (SNPs, insertions etc.), their exact position on genome, average mapping quality and raw read depth.

../_images/gb_annotated_variants.png

If you would like to see the whole list of effects and annotations for variants as well as to get some general statistics (for example, to know number of variants by chromosome, find out how many variants are corresponding to SNP or insertions, to know number of effects by type and region and some other information), just open the annotated Genetic Variations file in View Report application. Read about the variant annotations and report statistics in Whole Exome Sequencing tutorial, in Effect annotation section.

Use Variant Explorer application to know what effect is generated by each separate variant as well as to sort and filter the variants by various fields, such as mutation type, quality, locus, etc.

../_images/variant_explorer_annotated_variants.png

This application is based on the open-source SnpEff tool and best used in Whole Exome Sequencing and Whole Genome Sequencing analyses.

Variant explorer

Action: to interactively explore genetic variations such as SNPs, MNPs, and indels at specific genomic positions. The application not only displays the information about variants but also allows you to sort and filter by various fields, such as mutation type, quality, locus, etc.

../_images/variant_explorer_app_page.png

Variant Explorer takes as input a  Genetic Variations file which can be imported or generated with the Variant Calling application. If you open it in the application, you will see default DP (Raw read depth) and MQ (Average mapping quality) columns (“Other” tab in “Columns” section).

../_images/variant_explorer_other.png

Variants can be annotated with the Effect Prediction application that analyses genomic position of the variants and reveals the effects they produce on known genes (such as amino acid changes, synonymous and nonsynonymous mutations, etc.). For such variants the following information will be shown (find it in “Effect prediction” tab).

../_images/variant_explorer_effect_prediction_tab.png
  • Effect — effect predicted by SnpEff tool;
  • Impact — impact predicted by SnpEff tool;
  • Functional class — functional class of a region, annotated by SnpEff tool.

Moreover, the application calculates “Additional metrics” such as genotype frequencies for homozygous samples with reference and alteration alleles (GF HOM REF and GF HOM ALT columns correspondingly), reads depth for homozygous samples with alteration allele (DP HOM ALT) and reads depth for heterozygous samples (DP HET).

../_images/variant_explorer_additional_metrics.png

Note

How many raw reads match to the reference and alternative alleles?

DP and DP4 fields may help.

DP is about raw read depth. DP4 refers to the reads covering the reference forward, reference reverse, alternate forward, alternate reverse bases. For example, DP4=0,0,1,2 means 1 read is the alternate base forward strand, 2 alternate base reverse strand, and no covering reads have a reference at that position. The sum of DP4 will not always equal to the DP value due to some reads being of too low quality.

Note

How can I find out an allele frequiency for a variant?

Have a look at allele frequency (RAF column) which is a fraction of reads supporting alternate allele (that information is provided in DP4 field). Our Variant Calling application is forced to fit the model of categorical allele frequencies, e.g. 0 (homozygous reference), ~0.5 (heterozygote, carrying 1 copy of each of the reference and alternate alleles) or 1 (homozygous alternate).

To change the default columns or add more columns, choose them in the corresponding tabs in “Columns” section and “Save” your changes. After that all selected columns will be displayed in Table viewer.

You can “download filtered data as .tsv” or create a new file with filtered variants.

Read more about this application in our tutorials on Whole Exome Sequencing and Whole Genome Sequencing analyses.

Intersect genomic features

Action: to perform an intersection between several feature files such as Mapped Reads files or Genetic Variations files. Depending on the input files, the applications generates different outputs, either Mapped Reads or Genetic Variations files.

Let’s look at the command line options:

  1. Rule for filtering option. The application can “Report overlapping features”. For example, you could isolate single nucleotide polymorphisms (SNPs) that overlap with SNPs from another file. For this, intersect two Genetic Variations files. But there are cases when you would like to know which features do not overlap with other ones (use “Report non-overlapping features” filter). (default: Report overlapping features)
  2. The Minimum overlapping fraction option allows you check whether a feature of interest has a specified fraction of its length overlapping another feature. (default: 10)
  3. The Rule for overlap strandedness option allows you to ignore overlaps on the same strand (“Discard overlaps on the same strand”), on the other strand (“Discard overlaps on the other strand”) or expect overlapping without respect to the strandedness (“None”). (default: None)

This application is based on the BEDtools.

Bisulfite sequencing data analysis

Bisulfite sequencing mapping with BSMAP

Action: to map high-throughput bisulfite sequencing reads at the level of the whole genome.

Let’s talk a bit about various settings:

  1. The Number of mismatches option lets you set the maximum number of allowed mismatches per read. Changing this number you can affect application runtime and percentage of mapped reads. There is an increase in the percentage of mapped reads and in the application runtime when increasing this value. For example, by default the read could be mapped to the genome with no more than 5 mismatches. (default: 5)
  2. Rule for multiple mappings option. The application can “only reports unique hits” for one mappable read or if your reads are mapped to multiple positions in the genome, “report 1 random “best” mapping”. In the last case, it stops duplicated genome regions from being omitted altogether. (default: Report 1 random “best” mapping)
  3. The BS data generation protocol option enables you to specify what library preparation method was used to construct the bisulfite converted library. (default: Lister)

If  the “Lister” protocol was used, your reads will be mapped to two forward strands. You can read more about this protocol in Lister et al. If you choose the “Cokus” protocol the application will align your reads to all four strands. You can find more details about this protocol in the original study by Cokus et al.

The application is based on the BSMAP aligner and it is used in the Whole-Genome Bisulfite Sequencing Analysis tutorial.

Reduced representation bisulfite sequencing mapping with BSMAP

Action: to map reduced representation bisulfite sequencing (RRBS) reads to the specific digestion sites on the genome.

Let’s talk a bit about various settings:

  1. Enzyme sequence option is important. It specify what sequence is recognized by by the restriction enzyme and used to digest genomic DNA in the process of library preparation. By default, the application uses the C-CGG sequence which is recognised in MspI restriction. (default: “C-CGG”)
  2. The Number of mismatches option lets you set the maximum number of allowed mismatches per read. Decreasing this number you can reduce application runtime and percentage of mapped reads. (default: 5)
  3. Rule for multiple mappings option. The application can “only reports unique hits” for one mappable read or if your reads are mapped to multiple positions in the genome, “report 1 random “best” mapping”. In the last case, it stops duplicated genome regions from being omitted altogether. (default: Report 1 random “best” mapping)
  4. The BS data generation protocol option enables you to specify what library preparation method was used to construct the bisulfite converted library. (default: Lister)

If  the “Lister” protocol was used, your reads will be mapped to two forward strands. You can read more about this protocol in Lister et al. If you choose the “Cokus” protocol the application will align your reads to all four strands. You can find more details about this protocol in the original study by Cokus et al.

The application is based on the BSMAP aligner.

Methylation ratio analysis

Action: to determine the percent methylation at each ‘C’ base in mapped reads. Next, you can view methylation ratios in Genome Browser.

Command line options are the following:

  1. The Minimum coverage option allows you to get results filtered by depth of coverage. But raising it to a higher value (e.g. 5) requires that at least five reads will cover the position. (default: not set)
  2. Trim N end-repairing fill-in bases option. For paired-end mappings, you can trim from 1 to 240 fill-in nucleotides in the DNA fragment end-repairing. For RRBS mappings, the number of fill-in bases could be determined by the distance between cuttings sites on forward and reverse strands. If you analyse WGBS mappings, it is recommended to set this number between 0~3. (default: not set)
  3. The Report loci with zero methylation ratios option is used to report positions with zero methylation. (default: unchecked)
  4. The Combine ratios on both strands option allows you to combine CpG methylation ratio from both strands. (default: unchecked)
  5. Only unique mappings parameter is checked in case you would like to process only unique mappings. (default: checked)
  6. The Discard duplicated reads option is used to remove duplicates from mapped reads. (default: checked)
  7. C/T SNPs filtering option. To ignore positions where there is a possible C/T SNPs detected, choose “skip” value. If you want to correct the methylation ratio according to the C/T SNP information estimated by the G/A counts on reverse strand, set “correct” value. Or, let the application do not consider C/T SNPs (“no-action” value). (default: no-action)

If you analyse paired reads one more option appears:

  1. The Discard discordant mappings parameter is used to discard all mappings where the two mates map uniquely but with unexpected orientation, or where the distance between two mapped mates differs from and internally estimated fragment length, including mates mapping to different chromosomes.

The outputs from Methylation Analysis application can be represented in the Genome Browser as Methylation ratios track.

../_images/methratio_in_gb.png

Note

What does the 0-1000 side bar represent?

These bars represent the final methylation frequency. To understand this, take a simple example. Let’s imagine, we investigate position 30 in the Chr X. This position has 10 reads contributing to the methylation frequency. 7 of these 10 reads reported Cs in this position (i.e. methylated Cs, no bisulfite conversion and Cs do not transform into Ts) and 3 reads showed Ts (unmethylated Cs, bisulfite conversion takes place). Then the final methylation frequency will be calculated as 7/10 = 0.7. This value is multiplied by 1000 to get 700 (this is the bar sides you see in Genome Browser). So, it means, that side bars with 0 value represent unmetylated position, and vice versa side bars with 1000 — show max methylation (all reads have methylated Cs in this case).

The Methylation Analysis application is based on the methratio.py script and it is used in the Whole-Genome Bisulfite Sequencing Analysis tutorial.

Microbiome data analysis

Microbiome analysis with QIIME

Action: to identify microbial species and their abundances in microbiome samples. The application accepts microbial sequencing reads and outputs Clinical or Research reports with abundance plots and microbiological diversity metrics.

The microbiome analysis can use either Greengenes (for bacteria) or UNITE (for fungi) reference databases to estimate the taxonomic composition of the microbial communities.

Let’s review the application options:

  1. OTU picking option. To pick OTUs (Operational Taxomonic Units), the application provides two protocols (default: open-reference):
  • closed-reference: reads are clustered against a reference sequence collection and any reads which do not hit a sequence in the reference sequence collection are excluded from downstream analyses.
  • open-reference: reads are clustered against a reference sequence collection and any reads which do not hit the reference sequence collection are subsequently clustered de novo (i.e. against one another without any external reference).
  1. Algorithm used for clustering. In case open-reference protocol, the application suggests you use uclust (by default) or sortmera_sumclust algorithms. If you prefer closed-reference protocol, choose between blast (by default), uclust_ref and sortmera algorithms.
  2. The Quality filter for pre-clustering step option will remove any low quality or ambiguous reads before clustering. (default: 0)
  3. The Join paired-end reads (for paired reads only) option will join paired-end reads before the clustering. (default: unchecked)

The next two options are available only for open-reference protocol:

  1. Taxonomy assignment will be performed using the blust, rdp, rtax, mothur, uclust or sortmerna algorithm. In case of closed-reference method, taxonomy assignment will always be performed by uclust algorithm. (default: blust)
  2. The Percent of reads to cluster de novo option is applied for reads that will not hit the reference database and will be cluster de novo. (default: 0,001)

Output reports include the following metrics:

counts for every taxonomic unit (how many reads match to a given group) in form of interactive plot:

../_images/microbime_analysis_counts.png

And table:

../_images/microbiome_analysis_table.png

alpha diversity (within each sample, how rich the sample is e.g. number of taxa identified):

../_images/microbiome_analysis_alpha_diversity.png

beta diversity (difference between a pair of samples)(heterogeneity of samples):

../_images/microbiome_analysis_beta_diversity.png

The application is based on the open-source tool QIIME.

Additional visualisation applications

This section includes the applications that can be used in various pipelines to view the content of the data (e.g. Sequencing Assay Viewer) or to display multiple data types on different steps of analyses (e.g. Genome Browser).

Sequencing assay viewer

Action: to show the content of Raw Reads file and look for specific nucleotide sequences which can be exact, reverse, complement or reverse complement to the sequence of interest.

../_images/sequencing_assay_viewer.png

To access this application, select the assay you are interested in, right click on it and from the “Explore” section select the application.

Genome browser

Action: to visualize different types of genomic data: mapped reads, genetic variants, methylation ratios and others.

../_images/gb_page.png

There are several tracks that can be visualized in Genome Browser:

  • Reference track displays reference genome, its genes (green boxes), transcripts, and their coordinates;
../_images/gb_reference_track.png
  • Coverage track represents the sequencing reads coverage for mapped reads
../_images/gb_coverage_track.png
  • Variation track shows genetic variants (SNPs, insertions etc.), their exact position on the genome, average mapping quality and raw read depth;
../_images/gb_variation_track.png
  • Methylation ratios track reflects the proportion of methylated and unmethylated cytosine residues.
../_images/gb_methylation_ratios_track.png

Also you can manage tracks: add new ones, hide or delete them. When manipulating with multiple tracks you can use the tracks mentioned above to create Combined track or Formula track. On the combined track several tracks are imposed and shown together, thereby comparing coverage for different samples.

../_images/gb_combined_track.png

Or you can apply some basic mathematical operations and create formulas based on your genomic data, for example, quantify average value between values corresponding to different samples. The results of the computations will be shown on the formula track.

Moreover, each track can be customised by changing its properties (track color, normalized values, show only SNPs, maximum and minimum values to be shown on a track, etc.). Use “Edit” button to change properties for multiple tracks at once.

Genome Browser allows you to browse either a specific genomic position (search by coordinates) or a specific feature (search by feature name). You can navigate through the data to find a feature of interest or explore regions surrounding the feature, and zoom in to nucleotide resolution. The found feature can be marked with sticky notes (Shift + click on the position on the track). When you share the Genome Browser page with your collaborators, sticky notes will  help to focus their attention on your findings.

You can see the Genome browser in action in this blog post.

Reference genomes

One way or another, many bioinformatics analysis pipelines rely on the use of a reference genome. For instance,  we use reference genomes in DNA methylation analysis, in differential gene expression analysis, and in the analysis of transcriptomic heterogeneity within populations of cells. The choice of a reference genome can increase the quality and accuracy of the downstream analysis or it can have a harmful effect on it. For example, it has been shown that the choice of a gene annotation has a big impact on RNA-seq data analysis, but also on variant effect prediction.

On Genestack, you can find several reference genomes for some of the most common model organisms. We are adding more and more reference genomes of model organisms to this list regularly.

For some organisms we provide several genomes, e.g. there are a couple of reference genomes for Homo sapiens.

../_images/public_reference_genomes.png

What are the differences between these reference genomes? And how do you chose the correct one? The answer is not so straightforward and depends on several factors – let’s discuss each of them:

  1. Reference genome assembly and release version

For instance: “Homo sapiens / GRCh37 release 75” vs “Homo sapiens / GRCh38 release 86”.

The numbers correspond to versions (or “builds”) of the reference genome – the higher the number, the more recent the version. We generally recommend you use the latest version possible. One thing to remember is that for the newest genome builds, it is likely that resources such as genome annotations and functional information will be limited, as it takes time for Ensembl/ UCSC to integrate additional genomic data with the new build. You can read more about it a blog post from Genome Spot blog and in this article from Bio-IT.

  1. One organism – many strains

K12 and O103 are two different strains of E.coli. K12 is an innocuous strain commonly used in various labs around the world. O103 is a pathogenic strain, commonly isolated from human cases in Europe. Depending on your experiment, you should choose a matching reference genome.

  1. Toplevel sequence or primary assembly
  • Toplevel reference genomes contain all chromosomes, sequence regions not assembled into chromosomes and padded haplotype/patch regions.
  • Primary assembly genomes contain all toplevel sequence region excluding haplotypes and patches.

We are strongly recommend to use primary assembly reference genomes, since they are best for performing sequence similarity searches while patches and haplotypes would confuse analysis.

  1. DNA or cDNA
  • DNA - reference genome contains sequence of genomic DNA;
  • cDNA reference genome consists of all transcripts sequences for actual and possible genes, including pseudogenes.
  1. Masked, soft-masked and unmasked genomes

There are three types of Ensembl reference genomes: unmasked, soft-masked and masked.

Masking is used to detect and conceal interspersed repeats and low complexity DNA regions so that they could be processed properly by alignment tools. Masking can be performed by special tools, like RepeatMasker. The tool goes through DNA sequence looking for repeats and low-complexity regions.

There are two types of masked reference genomes: masked and soft-masked.

  • Masked reference genomes are also known as hard-masked DNA sequences. Repetitive and low complexity DNA regions are detected and replaced with ‘N’s. The use of masked genome may adversely affect the analysis results, leading to wrong read mapping and incorrect variant calls.

Note

When should you use a masked genome?

We generally do not recommend using masked genome, as it relates to the loss of information (after mapping, some “unique” sequences may not be truly unique) and does not guarantee 100% accuracy and sensitivity (e.g. masking cannot be absolutely perfect). Moreover, it can lead to the increase in number of falsely mapped reads.

  • In soft-masked reference genomes, repeats and low complexity regions are also detected but in this case they are masked by converting to a lowercase variants of the base (e.g. acgt).

Note

When should you use a soft-masked genome?

The soft-masked sequence does contain repeats indicated by lowercase letters, so the use of soft-masked reference could improve the quality of the mapping without detriment to sensitivity. But it should be noted that most of the alignment tools do not take into account soft-masked regions, for example BWA, tophat, bowtie2 tools always use all bases in alignment weather they are in lowercase nucleotides or not. That is why, there is no actual benefit from the use of soft masked genome in comparison with unmasked one.

  • We recommend you use unmasked genomes when you do not want to lose any information. If you want to perform some sort of filtering, it is better to do so  after the mapping step.

Usually, reference genome name includes information about all these factors: organism, genome assembly, release, primary assembly/toplevel, masking procedure and molecule.

Example:

To perform Whole Exome Sequencing analysis, we recommend you use an unmasked reference genome of the latest releases and assemblies (e.g. Homo sapiens / GRCh38 release 85 (primary assembly, unmasked, DNA) for human samples).

The bioinformatics community is divided on the topic of the use of reference genomes. It is our personal opinion that it is best to always use unmasked genome and perform filtering after the mapping step. However, if you would like to read more on the topic, we suggest taking a look at the following papers:

  1. McCarthy DJ, Humburg P, Kanapin A, Rivas MA, Gaulton K, Cazier JB, Donnelly P. Choice of transcripts and software has a large effect on variant annotation. Genome Med. 2014;6(3):26. DOI: 10.1186/gm543;
  2. Frankish A, Uszczynska B, Ritchie GR, Gonzalez JM, Pervouchine D, Petryszak R, et al. Comparison of GENCODE and RefSeq gene annotation and the impact of reference geneset on variant effect prediction. BMC Genomics. 2015;16 (Suppl 8):S2. DOI: 10.1186/1471-2164-16-S8-S2.

Microarray data

Scientists are using DNA microarrays to quantify gene expression levels on a large scale or to genotype multiple regions of a genome.

Note

What is a DNA microarray?

It is a collection of microscopic DNA spots attached to a solid surface. Each spot contains multiple identical DNA sequences (known as probes or oligos) and represents a gene or other DNA element that are used to hybridize a cDNA or cRNA sample (called target) under high-stringency conditions. Probe-target hybridization is measured by detection of targets labeled with a molecular marker of either radioactive or fluorescent molecules.

Expression arrays

Microarrays are useful in a wide variety of studies with a wide variety of objectives. In this section we will look at expression microarrays.

Microarray Explorer

Action: performs the process of collaboratively producing and exploring complete Microarray reports: from quality control check, dose response analysis, to differential expression analysis.

The application is based on the following Genestack applications that you can also freely use separately:

Microarrays normalisation

When investigating differential gene expression using microarrays, it is often the case that the expression levels of genes that should not change given different conditions (e.g. housekeeping genes) report an expression ratio other than 1. This can be caused by a variety of reasons, for instance: variation caused by differential labelling efficiency of the two fluorescent dyes used or different amounts of starting mRNA. You can read more about this in this document.

Normalisation is a process that eliminates such variations in order to allow users to observe the actual biological differences in gene expression levels. On Genestack, we have four different Microarray Normalisation applications - one for each of the four commonly used chips: Affymetrix, Agilent, L1000 and GenePix.

Affymetrix microarrays normalisation

Action: to perform normalisation of Affymetrix microarray data.

To normalize Affymetrix microarrays the application uses Robust Multi-array Average (RMA) method. First, the raw intensity values are background corrected, log2 transformed and then quantile normalized. Next a linear model is fit to the normalized data to obtain an expression measure for each probe set on each array. For more on RMA, see this paper.

The normalised data can be assessed with the Microarray quality control application enabling you to detect potential outliers and probably remove them from the downstream analysis. Good quality data can be further processed with Dose response analysis or Test differential expression for microarrays applications.

The application is based on the affy R package.

Agilent microarrays normalisation

Action: to perform normalisation of Agilent microarray assays.

For 1-channel Agilent microarrays, various procedures for background correction (e.g. “subtract”, “half”, “minimum”, “normexp”), and between-array normalisation (e.g. “quantile”, “scale”), can be applied.

For 2-channel Agilent microarrays, procedures for within-array normalisation (e.g. “loess”, “median”) can also be applied.

Note

What is the difference between 1-channel and 2-channel microarrays?

Two-channel (or two-color) microarrays are typically hybridized with cDNA prepared from two samples (or two experimental conditions) that the scientists want to compare, e.g. disseased tissue vs. healthy tissue. These arrays samples are labeled with two different fluorophores, say Cy5 and Cy3 dyes, and will emit signal with different intensuty. Relative intensities of each fluorophore may then be used in ratio-based analysis to identify up-regulated and down-regulated genes

In single-channel arrays, also called one-color microarrays, each experimental condition must be applied to a separate chip. They give estimation of the absolute levels of gene expression and only a sigle dye is used.

The normalised data can be assessed with the Microarray quality control application enabling you to detect potential outliers and probably remove them from the downstream analysis. Good quality data can be further processed with Dose response analysis or Test differential expression for microarrays applications.

The application is based on the limma R package.

GenePix microarrays normalisation

Action: to perform normalisation of GenePix microarray assays.

For GenePix microarrays, quantile between-array normalisation is performed and various procedures for background correction (e.g. “subtract”, “half”, “minimum”, “normexp”) can be applied.

The normalised data can be assessed with the Microarray quality control application enabling you to detect potential outliers and probably remove them from the downstream analysis. Good quality data can be further processed with Dose response analysis or Test differential expression for microarrays applications.

L1000 microarrays normalisation

Action: to perform normalisation of L1000 microarray assays.

To normalize L1000 microarrays, the application uses the “quantile” method for between-array normalisation.

The normalised data can be assessed with the Microarray quality control application enabling you to detect potential outliers and probably remove them from the downstream analysis. Good quality data can be further processed with Dose response analysis or Test differential expression for microarrays applications.

Microarray quality control

As in any statistical analysis, the quality of the data must be checked. The goal of this step is to determine if the whole process has worked well enough so that the data can be considered reliable.

Action: to perform quality assessment of normalised microarrays and detect potential outliers.

The application generates report containing quality metrics based on between-array comparisons, array intensity, variance-mean dependence and individual array quality. Some metrics have their own labels. It helps to understand according to which metric(s) the particular microarray is considered to be outlier.

../_images/microarray_gc_report.png

QC metrics are computed for both the unnormalised and normalised microarrays and include:

  1. Between array comparison metrics.
  • Principal Component Analysis (PCA) is a dimension reduction and visualisation technique that is used to project the multivariate data vector of each array into a two-dimensional plot, such that the spatial arrangement of the points in the plot reflects the overall data (dis)similarity between the arrays.

    For example, in the picture below, PCA identifies variance in datasets, which can come from real differences between samples, or, as in our case, from the failed “CD4 T lymphocytes, blood draw (1)” array.

../_images/microarray_qc_pca.png
  • Distances between arrays. The application computes the distances between arrays. The distance between two arrays is computed as the mean absolute difference (L1-distance) between the data of the arrays (using the data from all probes without filtering).

    The array will be detected as an outlier if for this array the sum of the distances to all other arrays is extremely large.

../_images/microarrays_qc_distances_between_arrays.png
  1. Array intensity statistics.
  • Boxplots of signal intensities represents signal intensity distributions of the microarrays. Typically, we expect to see the boxes similar in position and width. If they are different, it may indicate an experimental problem.
../_images/microarray_qc_boxplots_of_signal_intensities.png
  • Density plots of signal intensities show density distributions for microarrays. In a typical experiment, we expect these distributions to have similar shapes and ranges. The differences in density distributions can indicate the quality related problems.
../_images/microarray_qc_density_plots_of_signal_intensities.png
  1. Variance mean dependence metric.
  • “Standard deviation versus mean rank” plot is a density plot of the standard deviation of the intensities across arrays on the y-axis versus the rank of their mean on the x-axis. The red dots, connected by lines, show the running median of the standard deviation.

    After normalisation procedure we typically expect the red line to be almost horizontal. A hump on the right-hand side of the line may indicate a saturation of the intensities.

../_images/microarray_qc_standard_deviation_vs_mean_rank.png
  1. Individual array quality.
  • MA Plots allow pairewise comparison of log-intensity of each array to a “pseudo”-array (which consists of the median across arrays) and identification of intensity-dependent biases. The Y axis of the plot contains the log-ratio intensity of one array to the median array, which is called “M” while the X axis contains the average log-intensity of both arrays — called “A”. Typically, probe levels are not likely to differ a lot so we expect a MA plot centered on the Y=0 axis from low to high intensities.
../_images/microarray_qc_MA_plot.png

Additional Affymetrix-specific metrics are also computed for Affymetrix microarrays.

Overall, if you click on “Outlier detection overview” the application will detect apparent outlier arrays, suggest you remove them and re-normalise your data or continue differential expression or dose response analyses.

../_images/microarray_gc_report_outlier.png

The application is based on the ArrayQualityMetrics R package.

Differential gene expression for microarrays

Expression microarrays can simultaneously measure the expression level of thousands of genes between sample groups. For example, to understand the effect of a drug we may ask which genes are up-regulated or down-regulated between treatment and control groups, i.e. to perform differential expression analysis.

Once your microarray samples have been normalised, you can use them as inputs for differential expression analysis.

Test differential expression for microarrays

Action: to perform differential expression analysis between groups of microarray samples.

The application requires normalized microarrays to calculate differential expression statistics (such as log-expr, log-fold change, p-value and FDR) and microarray annotation to map probe identifiers to the gene symbols.

Let’s look at the options:

  1. Group samples by an experimental factor or condition that was specified in the metainfo of the samples. For example, if you have six samples — three of them are treated by compound X, and the rest three are untreated — the grouping factor will be the treatment procedure. If no grouping factor is available here, you should open your microarray assays in Metainfo editor and specify a grouping factor in a new column.
  2. Control group option. If you specify a control group, each group will be compared separately to that control group. If you do not specify a control group, each group will be compared against the average of all the other groups.

If you specify one or more confounding factors, you can identify differentially expressed genes between the tested groups of samples, while taking into account potential confounders, such as sex, age, laboratory, etc. In this case the detected gene expression changes are only caused by the factor of interest, for example treatment, while any possible effects of confounding factors are excluded. Specify confounding factors by choosing the corresponding metainfo keys.

Currently, only single-factor comparisons are supported. More complex experimental designs (batch effects, multi-factor analysis, etc.) will be supported in later versions of the application.

When the analysis is done, you can explore the results in Expression navigator.

Expression navigator

Action: to view and filter the results of differential gene expression analysis.

../_images/en_microarrays_app_page.png

The Expression Navigator page contains four sections:

  1. Groups Information section. It is a summary of the groups available for comparison. Size refers to the number of samples used to generate each group.
  2. The Top Differentially Expressed Genes section allows you to choose which groups to compare and how to filter and sort identified differentially expressed genes.
../_images/en_microarrays_DE_genes_table.png

You can filter differentially expressed genes by maximum acceptable false discovery rate (FDR), up or down regulation, minimum log fold change (LogFC), and minimum log counts per million (LogCPM).

../_images/en_microarrays_filtering.png

Let’s look through these statistics:

  • log-fold change: the fold-change in expression of a gene between two groups A and B is the average expression of the gene in group A divided by the average expression of the gene in group B. The log-fold change is obtained by taking the logarithm of the fold-change in base 2.
  • log-expression: log-transformed and normalised measure of gene expression.
  • p-value. The application also computes a p-value for each gene. A low p-value (typically, ≤ 0.05) is viewed as evidence that the null hypothesis can be rejected (i.e. the gene is differentially expressed). However, due to the fact that we perform multiple testing, the value that should be looked at to safely assess significance is the false discovery rate.
  • False discovery rate. The FDR is a corrected version of the p-value, which accounts for multiple testing correction. Typically, an FDR < 0.05 is good evidence that the gene is differentially expressed.

Moreover, you can sort the differentially expressed genes by these statistics, clicking the small arrows near the name of the metric in the table.

../_images/en_microarrays_sorting.png

The buttons at the bottom of the section allow you to refresh the list based on your filtering criteria or clear your selection.

  1. The top right section contains a boxplot of expression levels. Each colour corresponds to a gene. Each boxplot corresponds to the distribution of a gene’s expression levels in a group, and coloured circles represent the expression value of a specific gene in a specific sample.
../_images/en_microarrays_boxplots.png
  1. The bottom-right section contains a search box that allows you to look for specific genes of interest. You can look up genes by gene symbol, with autocomplete. You can search for any gene (not only those that are visible with the current filters).
../_images/en_microarrays_search_genes.png

Compound dose response analysis

Dose response analyser

Action: to find compound dosages at which genes start to show significant expression changes, i.e. the Benchmark Doses (BMDs). Genes that are differentially expressed between different doses are then analysed for pathway enrichment analysis; the application reports affected pathways and the average BMD of the genes involved in it.

This application takes as input normalised microarray data and performs dose response analysis. It requires a microarray annotation file to map probe identifiers to gene symbols (you can upload your own or use a publicly available one). It also requires a pathway annotation file to perform pathway enrichment analysis. Pathway files from Wikipathways are pre-loaded in the system.

The first step of the analysis is to identify genes that are significantly differentially expressed across doses. Once these are detected, multiple dose response models are fitted to each significant genes and statistics are recorded about the fits.

The following options can be configured in the application:

  1. The FDR filter for differentially expressed genes specifies the False Discovery Rate threshold above which genes should be discarded from the differential expression analysis (default: FDR < 0.1)
  2. Metainfo key for dose value. This option specifies the metainfo key storing the dose levels corresponding to each sample, as a numeric value. If no such attribute is present in your data, you need to open your microarray assays in the Metainfo editor and add it there.

For each gene, the application fits different regression models (linear, quadratic and Emax) to describe the expression profiles of the significant genes as a function of the dose. Then, an optimal model is suggested based on the Akaike Information Criterion (AIC), that reflects the compromise between the complexity of a model and its “goodness-of-fit”. A model with the highest AIC is considered “the best”, however a difference in AIC that is less than 2 is not significant enough models with these AIC should also be considered as good.

Mathematically AIC can be defined as:

AIC = 2(k - ln(L)),

where k is the number of parameters of the model, and ln(L) is the log-likelihood of the model.

Besides, for each gene, based on the optimal model, the benchmark dose (BMD) is computed.

Following the method described in the Benchmark Dose Software (BMDS) user manual the benchmark dose can be estimated as follows:

|m(d)-m(0)| = 1.349σ₀,

where m(d) is the expected gene expression at dose d, σ₀ is the standard deviation of the response at dose 0, which we approximate by the sample standard deviation of the model residuals.

Moreover, for each gene, a differential expression p-value is computed using family-wise F-test across all pairs of contrasts of the form “dose[n+1] - dose[n]” for n from 0 to <number of doses - 1>. Multiple testing correction is then performed and genes that pass the user-defined FDR threshold are categorised as differentially expressed.

Differentially expressed genes are then processed by gene set enrichment analysis. For each enriched pathway, the average and standard deviation of the BMDs of its genes are computed.

The application is based on the limma R package.

Dose response analysis viewer

Action: to display dose response curves and benchmark doses for differentially expressed genes and enriched pathways. Note that if no gene passed the FDR threshold specified in the dose response analysis application, the application will report the 1,000 genes with the smallest unadjusted p-values.

../_images/dose_response_analysis_report.png

Various regression models (linear, quadratic and Emax) are fitted for each identified differentially expressed gene to describe its expression profile as a function of the dose. These results are presented in an interactive table.

../_images/dose_response_analysis_table.png

The table includes information about:

  • PROBE ID – chip-specific identifier of the microarray probe;
  • GENE – the gene symbol corresponding to that probe (according to the microarray annotation file). Clicking on the gene name will show you a list of associated gene ontology (GO) terms;
../_images/dose_response_analysis_gene_ontology.png
  • BMD – the benchmark dose, corresponding to the dose above which the corresponding gene shows a significant change in expression, according to the best-fitting of the 3 models used. It is calculated using the following formula:

    Let m(d) be the expected gene expression at dose d. The BMD then satisfies the following equation: | m(BMD)-m(0) | = 1.349*σ. In this formula, σ is the standard deviation of the response at dose 0, which we approximate by the sample standard deviation of the model residuals.

  • BEST MODEL – the model with the optimal Akaike Information Criterion (AIC) among the 3 models that were fitted for the gene; the AIC rewards models with small residuals and penalizes models with many coefficients, to avoid overfitting;

  • MEAN EXPR – average expression of the gene across all doses;

  • T – the moderated t-statistic computed by limma to test for differential expression of the gene;

  • P – unadjusted p-value testing for differential expression of the gene across doses;

  • FDR – false discovery rate (p-value, adjusted for multiple testing);

  • B – B statistic computed by limma to test for differential expression of the gene. Mathematically, this can be interpreted as the log-odds that the gene is differentially expressed.

Here are examples of dose response curves as they are displayed in the application:

../_images/dose_response_analysis_plot.png

In the “Pathways” tab, you can see a list of significantly enriched pathways, based on the detected differentially expressed genes and the pathway annotation file supplied to the analysis application.

../_images/dose_response_analysis_pathways.png

The table includes:

  • PATHWAY – pathway name, e.g. “Iron metabolism in placenta”;
  • SIZE – pathway size, i.e. how many genes are involved in the given pathway;
  • DE GENES – how many pathway genes are found to be differentially expressed in our data. Clicking on the specific pathway takes you to the “Genes” tab where you can get expression profiles and regression curves for the differentially genes.
  • P – p-value;
  • FDR – false discovey rate value;
  • BMD – the pathway BMD is computed as the average of the BMDs of the significant genes involved in this pathway, computed with the model yielding the best AIC;
  • BMD SD – BMD standard deviation.

Methylation arrays

DNA methylation arrays are a widely-used tool to assess genome-wide DNA methylation.

Microarrays normalisation

Action: to perform normalisation of methylation microarray assays.

For methylation microarrays, normalisation can be performed with either “subsetQuantileWithinArray” or “quantile” method, and in addition, “genomeStudio” background correction may be applied.

../_images/meth-array-normalization-options.png

Further, the quality of normalised microarrays can be checked using the Microarray QC Report application to detect and remove potential outliers. Normalised microarrays that are of good quality may then be used in Differential methylation analysis and Differential regions methylation analysis.

The application is based on the minfi Bioconductor package.

Methylation array QC

Quality assessment of microarray data is a crucial step in microarray analysis pipeline, as it allows us to detect and exclude low-quality assays from the further analysis.

A single array with both red and green channels is used to estimate methylation for each individual sample. Then, for each CpG locus, both methylated and unmethylated signal intensities are measured.

Currently, there are two main methods used to estimate DNA methylation level: Beta-value and M-value. The Beta-value is the ratio of the methylated probe intensity against the overall intensity (sum of the methylated and the unmethylated probe intensities) plus a constant (Du P. et al., 2010). Therefore, Beta-value can reflect roughly the percentage of methylated sites.

../_images/qc-betaval.png

The M-value is the log2 ratio of the intensities of the methylated probe versus the unmethylated probe (Du P. et al., 2010):

../_images/qc-mval.png

Action: to assess quality of methylation microarray assays.

The Methylation array QC application allows the user to export files containing methylation and unmethylation values, as well as the Beta-values, the M-values and Log median intensity values.

Additionally, you can download and explore “Copy number values” file with the sum of the methylated and unmethylated signals.

Methylation array QC application provides various types of quality control plots. Let’s explore QC-report for the Infinium 450K microarrays:

1) Log median intensity plot

The scatterplot represents the log median of the signal intensities in both methylated and unmethylated channels for each array. In general, samples of good quality cluster together, while “bad” samples tend to separate, typically with lower median intensities.

../_images/log-median-intensities.png

2) Beta-values of the assays are represented by two plots:

  • Beta density plot represents the Beta-value densities of samples
../_images/qc-beta-density.png
  • Beta density bean plot also shows methylation the Beta-values.
../_images/qc-beta-density-bean.png

3) Control probes plots:

The Infinium 450K arrays have several internal control probes helping to track the quality on different stages of assay preparation (based on Illumina’s Infinium® HD Assay Methylation Protocol Guide):

Sample-independent controls

Several sample-independent controls allow the monitoring different steps of the of microarray assay preparation and include:

  • Staining control strip, which estimate the efficiency of the staining step for both the red and green channels. They are independent of the hybridization and extension steps.
../_images/qc-staining.png
  • Extension control strip, which tests efficiency of single-base extension of the probes that incorporates labeled nucleotides. Both red (A and T, labeled with dinitrophenyl) and green (C and G labeled with biotin) channels are considered.
../_images/qc-extension.png
  • Hybridization control strip, which estimates entire performance of the microarray assay.

This kind of controls uses synthetic targets that are complementary to the array probe sequence. Extension of the target provides signal. The higher concentration of the target is used, the higher signal intensity will be registered.

../_images/qc-hybridisation.png
  • Target removal control strip, which tests whether all targets are removed after extension step. During extension reaction the sequence on the array is used as template to extend the control oligos. The probe sequences, however, are not extendable. The signal is expected to be low in comparison to the signal of hybridization control.
../_images/qc-target-removal.png

Sample-dependent controls

A number of sample-dependent controls are provided to assess quality across samples.

  • Bisulfite-conversion controls

To estimate methylation of DNA, the 450k assay probe preparation involves bisulfite conversion of DNA when all unmethylated cytosines are converted to uracils, while methylated cytosines are remains as they are.

Bisulphite conversion I control strip

This control uses Infinium I assay chemistry. There are two types of probes in this control: bisulphite-converted and bisulphite-unconverted ones. If the bisulphite conversion was successful, the converted probes matches the converted DNA, and are extended. If the sample has some unconverted DNA, the unconverted probes get extended.

../_images/qc-bis-conversion-I.png

Bisulphite conversion II control strip

This control uses the Infinium I chemistry technology. If the bisulphite conversion went well, the adenin base is added, generating signal in the red channel. If there is some unconverted DNA, the guanine base is incorporated, resulting to signal in the green channel.

../_images/qc-bis-conversion-II.png
  • Specificity controls, which monitor potential non-specific primer extension.

Specificity I control strip is used to assess allele-specific extention for the Infinium I chemistry assays.

../_images/qc-specificity-I.png

Specificity II control strip allows to estimate specificity of extension for Infinium II assay and test whether there is any nonspecific methylation signal detected over unmethylated background.

../_images/qc-specificity-II.png

All the QC-plots shown on the application page may be downloaded in PDF format (see Minfi PDF Report).

Finally, based on the QC-results you can exclude particular samples as outliers, remove them, and re-normalize the rest of the assays together. To do so, click Sample list and select those samples that pass QC-check, then click Remove outliers and re-normalise button.

../_images/QC-sample-list.png

Then, if you are happy with quality of re-normalized arrays, you can proceed to the following step - Differential Methylation Analysis.

The “Methylation array QC” application is based on the minfi and the shinyMethyl Bioconductor packages.

Test differential methylation

Action: to identify differential methylation in single CpG sites (‘a differentially methylated positions (DMP)’) across groups of normalized microarray assays using linear models. Currently, 450k and EPIC Illumina’s Methylation arrays are supported.

The input data for this application is Infinium Methylation Normalization file obtained with the “Infinium Methylation Normalization” application.

The analysis includes annotating data when the application determines genomic position of the methylated loci and its location relatively to various genomic features. Differential methylation analysis application supports custom Methylation Array Annotation that you can upload with Import Data application.

The application computes differential methylation statistics for each CpG site for the selected group compared to the average of the other groups. Besides, you can assess differential methylation for each group compared to a control one.

The application has the following options:

1. “Group samples by” option allows to group assays for comparison automatically: the application helps you to group your samples according to experimental factor indicated in metainfo for the microarray assays such as disease, tissue or treatment, etc. (default: None)

2. Control group option allows to consider one of the created groups as a control one. In this case the application performs differential methylation analysis for each CpG site in the group against the control one. (default: No control group)

../_images/diff-meth-options.png

If you specify one or more confounding factors, you can identify differentially methylated sites between tested groups of samples, while taking into account potential confounders, such as sex, age, laboratory, etc. In this case the detected methylation changes are only caused by the factor of interest, for example treatment, while any possible effects of confounding factors are excluded. As confounding factors must be chosen according to metainfo keys common to all samples, remember to specify the relevant information for all the samples.

Explore the output with interactive Methylation Navigator.

The application is based on the minfi, limma Bioconductor packages.

Test differential regions methylation

Action: to determine and analyse contiguous regions which are differentially methylated across groups of normalized microarray assays. Currently, 450k and EPIC Illumina’s Methylation arrays are supported.

As an input the application takes “Infinium Methylation Normalization” file with normalised microarray assays and returns Differential Methylation Statistics file that you can further explore with the Methylation Navigator. Differential methylation analysis application supports custom methylation chip annotations that you can upload with Import Data application.

The application has the following options:

1. “Group samples by” option allows to automatically group assays according to an experimental factor indicated in metainfo for the selected microarray assays such as disease, tissue or treatment, etc. (default: None)

2. Control group option allows to consider one of the created groups as a control one. In this case the application performs differential methylation analysis for each region in the group against the control one. (default: No control group)

../_images/diff-meth-options.png

If you specify one or more confounding factors, you can identify differentially methylated regions between tested groups of samples, while taking into account potential confounders, such as sex, age, laboratory, etc. In this case the detected methylation changes are only caused by the factor of interest, for example treatment, while any possible effects of confounding factors are excluded. As confounding factors must be chosen according to metainfo keys common to all samples, remember to specify the relevant information for all the samples.

The Test Differential Regions Methylation application is based on the minfi and DMRcate packages.

Explore the output with interactive Methylation Navigator.

Methylation navigator for sites

Action: to view, sort and filter the results of analysis of differential methylation positions (DMPs).

../_images/MN-sites.png

The Methylation Navigator page contains four sections:

  1. The Groups Information section summarise the information on the created groups of samples to be tested.

2. The Top Differentially Methylated Sites table lists all the detected sites that are differentially methylated in the selected group compared to either the average of the other groups or a control group (if it is set).

../_images/MN-top-sites.png

For each DMP (differentially methylated position) or DMR (differentially methylated region), its Delta Beta, Average Beta, P-value, and FDR are shown.

Click probe ID to get more information about the probe:

../_images/MN-sites-annotation.png

You can filter by maximum acceptable false discovery rate (FDR), up or down regulation, minimum log fold change (LogFC), and minimum log counts per million (LogCPM).

You can reduce the list of DMPs by filtering the data in the table based on the following criteria:

  • Max FDR (maximum acceptable false discovery rate) — only shows sites with FDR below the set threshold.
  • Methylation All/ Down/ Up — to show all sites or just those that are hypo- or hypermethylated.
  • Min Delta Beta — delta Beta represents the difference between the Beta values in the groups being compared; this filter can be used to get only sites with absolute Delta Beta value of at least this threshold.
  • Min Average Beta — only shows sites with average Beta value of at least this threshold.
../_images/MN-sites-filter.png

Sort the list of probes by clicking the arrows next to the name of the statistical metrics in the table headers.

../_images/MN-sites-sort.png
  1. A boxplot of methylation levels

Each color corresponds to an individual probe you selected; each circle represents an assay belonging to the tested group. Each boxplot represents the distribution of a methylation in a given group. The y-axis shows Beta values, while the x-axis shows probe IDs.

../_images/MN-sites-boxplot.png

4. The bottom-right section contains a search box that allows you to explore the results for a particular probe. Start typing a probe ID and select the probe of interest in the appeared drop-down list of possible variants.

../_images/MN-sites-search.png

You can further export either the complete table of differential methylation analysis for all the groups or the list of values for the specific comparison in TSV format. See Export Data (for all comparisons, as .tsv) and Download filtered data for current comparison as .tsv options, respectively.

../_images/MN-sites-export.png

Methylation navigator for regions

Action: to view, sort and filter the results of analysis of differential methylation regions (DMRs).

../_images/MN-regions.png

The Methylation Navigator page contains the following sections:

  1. The Groups Information section summarise the information on the created groups of samples to be tested.
../_images/MN-regions-group-info.png

2. The Top Differentially Methylated Regions table shows all the detected regions that are differentially methylated in the selected group compared to either the average of the other groups or a control group (if it is set).

../_images/MN-top-regions.png

You can further reduce the list of identified DMRs and exclude those regions that do not meet set filtering criteria. The following filters can be applied:

  • Max FDR (maximum acceptable Stouffer-transformed false discovery rate) — the FDR is statistical certainty that the given region is differentially methylated. This filter only shows regions with the FDR values below the set threshold. Learn more about Stouffer-test from the paper by Kim S.C. (2013).
  • Methylation (Down/All/Up) — shows all regions or only hypo- or hypermethylated ones.
  • Min BetaFC (minimum mean beta fold change within the region) — for every DNA region, each probe has its Beta value, which is defined as relative methylation of the region (B1, B2 etc.). BetaFC, in this case, can be defined as mean Beta fold change; apply the filter to show only regions having BetaFC below the threshold.
  • Min significant CPG sites count — minimum number of CpG sites inside the genomic region.
../_images/MN-regions-filters.png

You can also sort the list of identified DMRs by clicking the arrows next to the name of the statistical metrics in the table.

../_images/MN-regions-sort.png

Finally, you can export both the complete table of top differential methylated regions for all the groups (Export Data (for all comparisons, as .tsv)) and the list of regions with associated statistics for the one comparison in TSV format (Download filtered data for current comparison as .tsv).

../_images/MN-sites-export.png