Get started with Genestack¶
In this tutorial, we would like to introduce you to the core features of Genestack Platform. You will learn how our system deals with files, how it helps you organise and manage your data and how to share data with your colleagues. You will see how easy it is to work on private and public data simultaneously and seamlessly, and how to reproduce complex analyses with data flows, a built-in mechanism for capturing and replaying your research.
How to find relevant data?¶
From the Dashboard you can go to the Data Browser — the application that makes a search of the relevant biological data fast and effective. Although at this point, you may have no private files, you can access public data available on the platform.
Currently, we have a comprehensive collection of publicly available datasets imported from GEO NCBI, ENA, SRA and ArrayExpress. We also provide other useful data that can be used in bioinformatic analysis, namely reference genomes or annotations.
One of the key features of Genestack is that all files are format-free objects: raw reads, mapped sequence, statistics, genome annotations, genomic data, codon tables, and so forth. All files have rich metadata, different for each file type.
Let us take a look at an example. Apply filter “Reference genomes” to see pre-loaded reference genomes. There is no single, standard, commonly accepted file format for storing and exchanging genomic sequence and features: the sequence can be stored in FASTA, EMBL or GenBank formats. Genomic features (introns, exons, etc.) can, for example, be represented via GFF or GTF files. Each of these formats themselves has flavours, versions, occasionally suffering from incompatibilities and conflicts. In Genestack you no longer have to worry or know about file formats. A Reference Genome file contains packed sequence and genomic features. When data, such as reference genomes, is imported into Genestack (and several different formats can be imported) it is “packed” into a Genestack file, meaning all reference genomes will behave identically, regardless of any differences in the physical formats underneath. You can browse reference genomes with our Genome Browser you can use them to map raw sequencing reads, to analyse variations, to add and manage rich annotations such as Gene Ontology and you never have to think about formats again.
Click a dataset name to view the associated metainformation with the Edit Metainfo app. Some metadata fields are filled in by our curators, some are available for you to edit with Edit Metainfo app, and some are computed when files are initialised.
Besides, you can explore metadata of any file wherever you are in the platform using View metainfo option in the context menu. All files have rich metadata, different for each file type.
How to import data?¶
There are various options for importing your data. You can drag and drop or select files from your computer, import data from URL or use previous uploads.
After data is uploaded and imported, the platform automatically recognizes file formats and transforms them into biological data types such as raw reads, mapped reads, reference genomes and so on. This means you will not have to worry about formats at all and this will most likely save you a lot of time. If files are unrecognized, you can manually allocate them to a specific data type by drag and drop.
On the next Edit metainfo step, you can describe uploaded data. Using an Excel-like spreadsheet you can edit the file metainfo and add new attributes, for example, cell type or age.
Additional option of importing your data is using import templates. On the Dashboard you can find an Add import template option. Import templates allow you to specify required and optional metainfo attributes for different file kinds. When you scroll down to the bottom of the page, you will see an Add import template button.
How to build and run a pipeline?¶
All files on Genestack are created by various applications. When an application creates a new file, it specifies what should happen when it is initialised: a script, a download, indexing, computation. In practice, it means that uninitialised files are cheap and quick to create, can be configured, used as inputs to applications to create other files, and then, later, computed all at once. Let’s look at an example. Go to the public experiment library and choose “Whole genome sequencing of human (russian male)” dataset.
Click Analyse button and, then, select Trim Adaptors and Contaminants in the list of the suggested applications. If you want to analyse some of the files from a given dataset, you can select the files you are interested in and Make a subset the entire dataset.
Regardless the input you would like to start with, at this step you do not have to start initialisation right away. In fact, you can use the file created by the app as an input to applications and continue building the pipeline. Notice that you can edit the parameters of analysis on the app page. You can change them because the file is not yet initialised, i.e. the computation – in this case, trimming – has not yet been started. After initialisation has completed, these parameters are fixed. Thanks to these parameters are saved in the metainfo, they can be further used to identically reproduce your work.
To start initialisation of a newly created file, click on the name of the file and select Start initialisation.
To use this file as an input for a different application, for example to map the trimmed raw reads to a reference genome, you should click on Add step and select the “Spliced Mapping with Tophat2” application.
As a result, another dataset called “Spliced Mapping with Tophat2 ” is created and is waiting to be initialised. On the application page you can check if the system suggested a correct reference genome and if not, select the correct one.
This dataset, in turn, can be used as an input for a different application. As the last step of the analysis you could, for example, identify genetic variants by adding the ”Variant Calling” app. In order to see the entire data flow we have just created, click on the name of the last created file, go to “Manage” and File Provenance.
It will show you processes that have been completed, and ones that need to be initialised. To initialise only one of the steps, click on a given cell, then on Actions and later select Start initialization. To initialise all of the uninitialised dependencies, simply press Start initialisation at the top.
You can track the progress of your computations using Task Manager that can be found at the top of the page.
How to reproduce your work?¶
Now, let’s talk about reproducibility. We will show you how to take any data file in Genestack Platform, and repeat the analysis steps that led up to it on different data.
Let’s go back to the genetic variations file you created called “Variant calling” file. To find analysis results, you might go to the Recent Results on the Dashbord, find a dataset in the Data Browser or go to the “My datasets” folder in the File Manager. You can also find it in the tutorial folder. Rather than viewing its provenance like we did before, let’s see if we can reuse the provenance. To do this, select the file, go to Manage and Create new Data Flow.
In the next screen you will see the data flow we have previously created.
The data flow editor has one core goal: to help you create more files using this diagram. To do this, you will need to make some decisions for boxes in the diagram via the Action menu. If you want to select different files, go to Choose another file. If you want to leave the original file simply do not change anything.
In this example, we will use this data flow to produce variant calls for another raw sequence data file, FS02 reproducing the entire workflow including trimming low-quality bases, spliced mapping and variant calling. All you need to do is choose another input file and click on Run dataflow button at the top of the page. You will be given a choice: you can initialize the entire data flow now or delay initialization.
If you decide to delay the initialization till later, you will be brought back to the Data Flow Runner page where you can initialize individual files by clicking on the file name and later selecting Start initialization.