Microbiome data analysis

Previously, to study microbes — bacteria, archaea, protists, algae, fungi and even micro-animals — it was necessary to grow them in a laboratory. However, many of the microorganisms living in complex environments (e.g. gut or saliva), have proven difficult or even impossible to grow in culture. Recently developed culture-independent methods based on high-throughput sequencing of 16S/18S ribosomal RNA gene variable regions and internal transcribed spacers (ITS) enable researchers to identify all the microbes in their complex habitats, or in other words, to analyse a microbiome.

The purpose of this tutorial is to describe the steps required to perform microbiome data analysis, explain how to build your own data flow, and finally, discuss the results obtained in such analysis.

Setting up a microbiome experiment

First, we need a good example of microbiome data. You can upload your own data using Import button or search through all the available experiments using Data Browser application. Our analysis will be based on data coming from Alfano et al. 2015. Let’s look up this experiment in the platform and open it in the Metainfo Editor:


The researchers examined the microbiome composition of different body regions of two captive koalas: the eye, the mouth and the gut (through both rectal swabs and feces). Samples were characterized by 16S rRNA high-throughput Illumina sequencing. First, since koalas frequently suffer from ocular diseases caused by Chlamydia infection, scientists examined the eye microbiome and found out that it is very diverse, similar to other mammalian ocular microbiomes but has a high representation of bacteria from the Phyllobacteriaceae family. Second, authors determined that despite a highly specialized diet consisting almost exclusively of Eucalyptus leaves, koala oral and gut microbial communities are similar in composition to the microbiomes from the same body regions of other mammals. Also, it was found that the rectal samples contain all of the microbial diversity present in the faecal ones. And finally, the researchers showed that the faecal communities of the captive koalas are similar to the ones for wild koalas, suggesting that captivity may not influence koala microbial health.

Microbiome data analysis pipeline

A typical data flow for microbiome analysis consists of the following steps:

  1. Quality control of raw reads
  2. Preprocessing of raw reads
  3. Quality control of preprocessed reads
  4. Microbiome analysis

Let’s go through each step to get a better idea of what it really means.

Quality control of raw reads

Garbage in — garbage out. It means that your analysis is only as good as your data. Therefore, the first and very important step in any kind of analysis is quality control (QC). It allows to look at some relevant properties of the raw reads, such as quality scores, GC content, base distribution, etc., and check whether any low-quality reads, PCR primers, adaptors, duplicates and other contaminants are present in the samples. In order to assess the quality of the data, we’ll run the Raw Reads QC data flow.

Genestack FastQC application generates basic statistics and many useful data diagnosis plots. Here is some of them for Mouth of Pci_SN265.


Basic statistics reports some sample composition metrics such as reads type, number of reads, GC content and total sequence length. Our sample contains 470,459 paired-end reads, which all together give us a sequence of 236,170,418 bp in length. The GC content is equal to 49%.


Sequence length distribution module gives us information about read length in a sample. In our example, all the reads have the same length equal to 251 bp.


Per sequence GC content graph shows GC distribution over all reads. A roughly normal distribution indicates a normal random library.


However, as in our case, there are sharp peaks which may usually indicate the presence of adapter, primer or rRNA contamination. To remove possible contaminants, we’ll run Trim Adaptors and Contaminants application.

Per base sequence quality plots show the quality scores across all bases at each position in the reads. By default, only low-quality zone and mean quality lines are displayed. If the median (or the middle quartile Q2) is less than 20 or the lower quartile (Q1) is less than 5, you’ll get failures (you see the red cross near the metric).


In our sample, the second mates in paired-end reads have bases of bad quality at the end of the sequences. To get rid of these bases and improve the reads quality we’ll run Trim Adaptors and Contaminants and Filter by Quality Scores applications.

Per sequence quality scores report allows you to see frequencies of quality values in a sample. The reads are of good quality if the peak on the plot is shifted to the right, to the maximum quality score.


In our example, first and second mate reads differ by quality score, but still, almost all of them are of good quality (>30). We will improve the reads quality by running Filter by Quality Scores application.

Per base sequence content plots show nucleotide frequencies for each base position in the reads. In a random library, there could be only a little difference between A, T, C, G nucleotides, and the lines representing them should be almost parallel with each other. The black N line indicates the content of unknown N bases which shouldn’t be presented in the library.


Since we analyse 16S microbiome data, all the reads should begin with the same forward primer, and that’s why we may observe the peaks of 100 % base content in the first positions. The nucleotide frequency pattern in the rest of the sequence can be explained by the 16S sequence variation of the analysed bacterial community.

Sequence duplication levels plots help us evaluate library enrichment. In other words, the percentage of the library made up of sequences with different duplication levels.


The application also picks up the overrepresented sequences which may represent primer or adaptor contamination as well as indicate highly expressed genes.


The last two QC metrics — Sequence duplication levels and Overrepresented sequences — should not be used to evaluate 16S microbiome samples. Since we are looking at sequencing data for only a single gene, we are expecting to see an excess of highly similar sequences, and in turn, to get failures for these modules.

We have run QC on all the data in the experiment and put the reports in Raw reads QC reports for Alfano et al. (2015) folder, so that you can open all of them in Multiple QC Report application to analyse results.


You see that a total number of sequencing reads for each sample is quite small and vary in the range of 197,000 to 471,000 paired-end reads. Overall, more than 2,5 million paired-end sequencing reads were generated.

Preprocessing of raw reads

FastQC reports help you understand whether it is necessary to improve the quality of your data by applying trimming, filtering, adaptor clipping and other preprocessing steps. Here is the list of Genestack preprocess applications available for raw reads:

  • Trim adaptors and contaminants;
  • Trim low quality bases;
  • Filter by quality scores;
  • Trim to fixed length;
  • Subsample reads;
  • Filter duplicated reads.

Once we have checked the quality of the raw reads, let’s start building the Microbiome data analysis pipeline. Our preprocessing procedure will include two steps — adaptor trimming and filtering out low-quality reads.

  1. Trim adaptors and contaminants

For this, click Analyse and select Trim Adaptors and Contaminants:


This brings you to the application page. On this step, the application will scan the reads for adaptors, primers, N bases or any poor quality nucleotides at the ends of reads, and, based on a log-scaled threshold, will perform clipping.


By default, the application uses an internal list of widely used PCR primers and adaptors that can be considered as possible contaminants. The occurrence threshold before adapter clipping is set to 0.0001. It refers to the minimum number of times an adapter needs to be found before clipping is considered necessary.

After trimming, the reads are getting shorter. In order to discard trimmed reads of length below a specific value, indicate this value in the box for “Minimum length of the trimmed sequence (bp)” option. We will use the default length of 15 nucleotides.

To get more information about the application and its parameters, you can click the application name at the top of the page and choose About application.


Trimmed reads for our 8 samples are stored in Trimmed raw reads for Alfano et al (2015) folder.

  1. Filter by quality scores

Let’s continue our preprocessing procedure and add our next step - “Filter by Quality Scores”:


This application will filter out the reads based on their quality scores. For this purpose, we need to specify “Minimum quality score” and “Percentage of bases to be above the minimum quality score”.


According to the paper, only reads with at least 75% of read length with the quality score above 30 were kept for the future analysis. Let’s use the same settings: 30 for the minimum quality score and 75% for percentage threshold.

Learn more about the application work in About application section.

All 8 samples after this filtering are collected in Filtered trimmed raw reads for Alfano et al (2015) folder.

Quality control of preprocessed reads

After preprocessing steps, we will use the FastQC Report application again to compare the quality of raw and preprocessed reads and to make sure that we improved reads quality.

Here is the FastQC report for “Mouth of Pci_SN265” sample before preprocessing:


The Per base sequence quality plots depict low-quality bases at the ends of the second mate reads. After trimming and filtering, the overall quality of reads has been improved (see the Per base sequence quality and Per sequence content modules). We also expect warnings for Sequence length distribution since the length of the reads has been changed during preprocessing.


You can find all FastQC reports for preprocessed reads in the QC reports of preprocessed raw reads for Alfano et al (2015) folder.

All in all, we can notice that the quality of the reads has been noticeably improved after preprocessing, and our samples are ready for the downstream microbiome analysis.

Microbiome analysis

The last step in the microbiome analysis pipeline is identification of microbial species and their abundances in the microbiome samples of the two captive koalas examined. To analyse the taxonomic composition of the microbial communities the Microbiome analysis app requires a reference database containing previously classified sequences, such as Greengenes (16S rRNA gene) and UNITE (fungal rDNA ITS region sequences) databases. The application compares identified OTUs (operational taxonomic units) to the OTUs in a reference database.

Microbiome analysis results in two reports: Research report and Clinical report. In this tutorial we will focus on the research microbiome analysis report generated for all the tested samples (Microbiome report for 8 files). To explore the clinical microbiome report you should open a research report for an individual sample and click the View Clinical Report button.

The research report provides abundance plots representing microbiota composition and microbiological diversity metrics.

Basic statistics describing the tested samples, such as sample count, number of reads per sample and number of clustered reads per sample are calculated during the analysis process.


You can group samples by relevant metadata keys with Group samples by option. Besides, you can apply some filters to display absolute OTU counts, hide unidentified OTUs or hide partially identified OTUs.

The plot displays the relative abundance of OTUs at a highest taxonomic resolution: genus (L6 tab) and species (L7 tab). You can change the resolution to the L2 level to see what phyla are the most abundant across the samples. For example, our results show that, at low taxonomic resolution (L2 tab), the composition of microbial communities is similar between samples. Bacteroidetes (8,30–86,73%), Firmicutes (1,46–50,49%) and Proteobacteria (1,38–64,96%) are the most abundant phyla across most of the samples, followed by Actinobacteria (0,02-15,14%) and Fusobacteria (0-10,33%).


You can see these results in the table as well. Click on the Total header in the table to get the most abundant phyla across the samples:


Our findings are consistent with the paper results:


At a higher taxonomic resolution (L7 tab), more than 820 microorganisms were identified. The koala eye and the koala gastrointestinal tract are characterized by distinct microbial communities. The eye microbial community was very diverse; despite the fact that there is a small number of very abundant phylae, eye microbiome is characterised with a high representation of bacteria from the family Phyllobacteriaceae (34.93 %).

../../_images/eye-microbiome.png ../../_images/microbiome-otu-number.png

To measure the similarity between the bacterial communities we used principal component analysis (PCA) based on Pearson correlation coefficients:


You may change the PCA type in the upper-left corner of the plot and try other statistics to quantify the compositional dissimilarity between samples: bray_curtis, abund_jaccard, euclidean, binary_pearson, binary_jaccard.

However, in comparison to the paper, authors used principal coordinate analysis (PCoA) to show the similarity between the koala microbial communities:


Both PCA and PCoA are used to visualize the data, but different mathematical approaches are applied to the data.


What is the difference between the PCA and PCoA?

The purpose of PCA is to represent as much of the variation as possible in the first few axes. For this, first, the variables are centred to have a mean of zero and then the axes are rotated (and re-scaled). In the end, we have two axes: the first one contains as much variation as possible, the second one contains as much of the remaining variation as possible, etc.

The PCoA takes uses a different approach, the one based on the distance between data points. First, the PCoA projects the distances into Euclidean space in a larger number of dimensions (we need n-1 dimensions for n data points). PCoA puts the first point at the origin, the second one along the first axis, then adds the third one so that the distance to the first two is correct (it means adding the second axis) and so on, until all the points are added. To get back to two dimensions, we apply a PCA on the newly constructed points and capture the largest amount of variation from the n-1 dimensional space.

Congratulations! You’ve just gone through the entire tutorial!