Get started with Genestack

In this tutorial, we would like to introduce you to the core features of Genestack Platform. You will learn how our system deals with files, how it helps you organise and manage your data and how to share data with your colleagues. You will see how easy it is to work on private and public data simultaneously and seamlessly, and how to reproduce complex analyses with data flows, a built-in mechanism for capturing and replaying your research.

How to find relevant data?

From the Dashboard you can go to the Data Browser — the application that makes a search of the relevant biological data fast and effective. Although at this point, you may have no private files, you can access public data available on the platform.

Currently, we have a comprehensive collection of publicly available datasets imported from GEO NCBI, ENA, SRA and ArrayExpress. We also provide other useful data that can be used in bioinformatic analysis, namely reference genomes or annotations.

../../_images/data-browser1.png

One of the key features of Genestack is that all files are format-free objects: raw reads, mapped sequence, statistics, genome annotations, genomic data, codon tables, and so forth. All files have rich metadata, different for each file type.

Let us take a look at an example. Apply filter “Reference genomes” to see pre-loaded reference genomes. There is no single, standard, commonly accepted file format for storing and exchanging genomic sequence and features: the sequence can be stored in FASTA, EMBL or GenBank formats. Genomic features (introns, exons, etc.) can, for example, be represented via GFF or GTF files. Each of these formats themselves has flavours, versions, occasionally suffering from incompatibilities and conflicts. In Genestack you no longer have to worry or know about file formats. A Reference Genome file contains packed sequence and genomic features. When data, such as reference genomes, is imported into Genestack (and several different formats can be imported) it is “packed” into a Genestack file, meaning all reference genomes will behave identically, regardless of any differences in the physical formats underneath. You can browse reference genomes with our Genome Browser you can use them to map raw sequencing reads, to analyse variations, to add and manage rich annotations such as Gene Ontology and you never have to think about formats again.

Click a dataset name to view the associated metainformation with the Edit Metainfo app. Some metadata fields are filled in by our curators, some are available for you to edit with Edit Metainfo app, and some are computed when files are initialised.

../../_images/metainfo-editor1.png

Besides, you can explore metadata of any file wherever you are in the platform using View metainfo option in the context menu. All files have rich metadata, different for each file type.

../../_images/metainfo-reference-genome.png

How to import data?

Now let’s discuss importing data into the platform. On the dashboard you can find an Import data option. Once you click it, this will take you to the Import Data app page.

../../_images/dashboard_import.png

There are various options for importing your data. You can drag and drop or select files from your computer, import data from URL or use previous uploads.

../../_images/import_1.png

After data is uploaded and imported, the platform automatically recognizes file formats and transforms them into biological data types such as raw reads, mapped reads, reference genomes and so on. This means you will not have to worry about formats at all and this will most likely save you a lot of time. If files are unrecognized, you can manually allocate them to a specific data type by drag and drop.

../../_images/import_2.png

On the next Edit metainfo step, you can describe uploaded data. Using an Excel-like spreadsheet you can edit the file metainfo and add new attributes, for example, cell type or age.

../../_images/import_3.png

Additional option of importing your data is using import templates. On the Dashboard you can find an Add import template option. Import templates allow you to specify required and optional metainfo attributes for different file kinds. When you scroll down to the bottom of the page, you will see an Add import template button.

../../_images/import-welcome-page1.png

How to build and run a pipeline?

All files on Genestack are created by various applications. When an application creates a new file, it specifies what should happen when it is initialised: a script, a download, indexing, computation. In practice, it means that uninitialised files are cheap and quick to create, can be configured, used as inputs to applications to create other files, and then, later, computed all at once. Let’s look at an example. Go to the public experiment library and choose “Whole genome sequencing of human (russian male)” dataset.

../../_images/wgs-russian-male-1.png

Click Analyse button and, then, select Trim Adaptors and Contaminants in the list of the suggested applications. If you want to analyse some of the files from a given dataset, you can select the files you are interested in and Make a subset the entire dataset.

../../_images/wgs-russian-male-2.png

Regardless the input you would like to start with, at this step you do not have to start initialisation right away. In fact, you can use the file created by the app as an input to applications and continue building the pipeline. Notice that you can edit the parameters of analysis on the app page. You can change them because the file is not yet initialised, i.e. the computation – in this case, trimming – has not yet been started. After initialisation has completed, these parameters are fixed. Thanks to these parameters are saved in the metainfo, they can be further used to identically reproduce your work.

../../_images/trim-adaptors-app.png

To start initialisation of a newly created file, click on the name of the file and select Start initialisation.

../../_images/cla-start-initialization.png

To use this file as an input for a different application, for example to map the trimmed raw reads to a reference genome, you should click on Add step and select the “Spliced Mapping with Tophat2” application.

../../_images/cla-add-step1.png

As a result, another dataset called “Spliced Mapping with Tophat2 ” is created and is waiting to be initialised. On the application page you can check if the system suggested a correct reference genome and if not, select the correct one.

../../_images/tophat.png

This dataset, in turn, can be used as an input for a different application. As the last step of the analysis you could, for example, identify genetic variants by adding the ”Variant Calling” app. In order to see the entire data flow we have just created, click on the name of the last created file, go to “Manage” and File Provenance.

../../_images/provenance1.png

It will show you processes that have been completed, and ones that need to be initialised. To initialise only one of the steps, click on a given cell, then on Actions and later select Start initialization. To initialise all of the uninitialised dependencies, simply press Start initialisation at the top.

../../_images/provenance2.png

You can track the progress of your computations using Task Manager that can be found at the top of the page.

How to reproduce your work?

Now, let’s talk about reproducibility. We will show you how to take any data file in Genestack Platform, and repeat the analysis steps that led up to it on different data.

Let’s go back to the genetic variations file you created called “Variant calling” file. To find analysis results, you might go to the Recent Results on the Dashbord, find a dataset in the Data Browser or go to the “My datasets” folder in the File Manager. You can also find it in the tutorial folder. Rather than viewing its provenance like we did before, let’s see if we can reuse the provenance. To do this, select the file, go to Manage and Create new Data Flow.

../../_images/create-new-data-flow1.png

In the next screen you will see the data flow we have previously created.

../../_images/run-data-flow.png

The data flow editor has one core goal: to help you create more files using this diagram. To do this, you will need to make some decisions for boxes in the diagram via the Action menu. If you want to select different files, go to Choose another file. If you want to leave the original file simply do not change anything.

../../_images/choose-another-file.png

In this example, we will use this data flow to produce variant calls for another raw sequence data file, FS02 reproducing the entire workflow including trimming low-quality bases, spliced mapping and variant calling. All you need to do is choose another input file and click on Run dataflow button at the top of the page. You will be given a choice: you can initialize the entire data flow now or delay initialization.

../../_images/delay-initialization-until-later1.png

If you decide to delay the initialization till later, you will be brought back to the Data Flow Runner page where you can initialize individual files by clicking on the file name and later selecting Start initialization.

../../_images/start_init.png

How to share and manage your data?

To share data we use groups – a shared project for two or more users. To manage existing groups or create new one click on the Genestack logo in the upper left corner and select Manage groups on the shortcuts menu.

../../_images/shortcuts_menu_manage_groups.png

On the Manage Groups page, if you have no groups click so far, click Create group button and create the first one.

../../_images/create-new-group.png

Right away we have a new group:

../../_images/my-new-group-members.png

And we can add a new member to this newly created group:

../../_images/add-user-to-the-group1.png

Now your group looks like this:

../../_images/first_group.png

No confirmation is needed – any user in your organisation can create a group and add other users from your organisation to it. You are the group administrator of any groups that you create. As the group administrator, you can add or remove other users from your group, or change their permissions, i.g. make them administrators, make them “sharing” or “non-sharing” users.

All groups appear as folders under Shared with me in File Manager, and the moment you add a user to a group they will see the group’s folder in their File Manager.

../../_images/shared-with-me.png

Group folders are the same as all other folders in the system: you can add and remove files to group folders just like to any other regular folder. There is an important point to note though: adding a file to a group folder is not the same as sharing it with the group.

To share one or more files with a group, you need to select them click Share using context menu. Some applications (e.g. Data Browser, Metainfo Editor) have Share button as well. In the opening window choose a group you want to share the selected files with.

../../_images/share1.png

After that click Share, and specify whether you want to add the shared data to the group folder or not with Link option.

../../_images/link-shared-files.png

If you choose to link the shared files, all group members will see the shared data at the top level of the group folder. If you do not link the files, the files will be still shared but members of the group will not see them in the group folder, although they could access the data via search. Moreover, you can always add shared files to group folders later.

It is very easy to share data with users in the same organisation. You simply create a group and share files; all group members see shared data immediately. What about sharing across organisations? Say, you work in a hospital research group and have imported some valuable pathogenic specimen sequence data into Genestack Platform and want to share it with your colleagues in a pharma company who work on some novel drugs to kill the pathogen. It is easy to set up a new cross-organisational group or to turn an existing group into one. When you add new users, simply type in the email address of the user from another organisation. Genestack Platform will autocomplete only users in your organisation, not for others. This is a security feature, which means that no one from any other organisation can find out who is registered in Genestack Platform from yours. After you enter the user’s email, you should create an invitation and send it to another organisation:

../../_images/manage-groups-invite.png

Your organisation administrator will need to approve the invitation, and then the other organisation’s administrator will have to approve it, too. After confirmation of collaboration by organisation administrators of both parties, the group becomes a cross-organisational group and other users can be added easily. The inviting organisation’s administrator will see on their group management screens the following:

../../_images/incoming-invitation.png

Once they confirm the outgoing invitation, the other organisation’s administrator will see the same in their Incoming invitations section and will have to confirm it as well. After both confirmations, the new group has members from both organisations:

../../_images/cross-org-group.png

Note that you can change the status of users from your organisation, but not from other organisations. A cross-organisational group can have multiple organisations participating in it. The addition of each new participating organisation needs approvals of administrators of all organisations in the group, as well as that of an administrator from the organisation being invited. Once the approvals are in, sharing is easy. So, you can easily collaborate across organisational (enterprise) boundaries, and appropriate administrative controls are in place.