FASTGenomics/hello_genomics_calc_py36

Name: hello_genomics_calc_py36

Owner: FASTGenomics

Description: Sample Python Calculation

Created: 2017-10-10 14:24:22.0

Updated: 2017-10-31 13:24:27.0

Pushed: 2018-01-15 15:48:20.0

Homepage:

Size: 121

Language: Python

GitHub Committers

UserMost Recent Commit# Commits
Philipp A.2018-01-10 12:28:02.02
Henning Dickten2018-03-06 14:33:48.018
Gerhard Schlemm2017-12-11 12:17:46.01

Other Committers

UserEmailMost Recent Commit# Commits
Christian Sauerchristian.sauer@comma-soft.com2017-12-11 12:25:24.05

README

o      O        o  o              .oOOOo.
O      o       O  O              .O     o                              o
o      O       o  o              o
OoOooOOo       O  O              O
o      O .oOo. o  o  .oOo.       O   .oOOo .oOo. 'OoOo. .oOo. `oOOoOO. O  .oOo  .oOo
O      o OooO' O  O  O   o       o.      O OooO'  o   O O   o  O  o  o o  O     `Ooo.
o      o O     o  o  o   O        O.    oO O      O   o o   O  o  O  O O  o         O
o      O `OoO' Oo Oo `OoO'         `OooO'  `OoO'  o   O `OoO'  O  o  o o' `OoO' `OoO'

How to write a FASTGenomics App

Writing an app is fairly easy - there are some conventions you need to know, but otherwise you are free to use any language, tools and methods you want. This document explains the basic structure using the example of the python “Hello Genomics” app and explains the workflow to get an app published and outlines core concepts.

TL;DR
Core concepts

There are two flavors of apps in FASTGenomics: Calculations and Visualizations. “Calculations” perform data-intensive tasks, for example clustering whereas “visualizations” display the aforementioned results. A visualization might take a clustering result and display a diagram for the user.

In the following section we'll give you an overview how to write your own FASTGenomics App, that can be used in analyses, and provide information about the ingredients you need:

Docker

Every application runs in the FASTGenomics runtime in the form of an own docker container (which you can imagine as self-sustaining, portable workplaces). Using docker containers helps us to eliminate the “works on my machine” problems and afford full reproducibility and transparency. Moreover using docker containers allows you to use any programming language and framework you want to achieve your results and makes things simple for us integrating your app into the analyses if you want to. You like Python? So do we. You are an Haskel or Julia expert? Just use it! Do you have a special configuration, which is extremely complicated or annoying to install? Just do it once and your app will work everywhere.

You never heard of Docker before? Read the article Docker Overview.

These are the very small number of things you really need to know:

In order to build and test your container proceed as follows:

  1. Install docker on your developer machine Install Docker (CE)
  2. Write the Dockerfile and docker-compose.yml
  3. Build your container with docker-compose -f <docker-compose.filename.yml> build
  4. Provide sample input data (have a closer look at our example) and check paths in the docker-compose.yml. We recommend relative paths.
  5. Start the app via docker-compose -f <docker-compose.filename.yml> up

You already have a working python-script? Just clone hello-genomics and interchange the main.py, rename the directory, and modify the paths in the Dockerfile.

One more thing: Once you started your application (container) you can list all current instances via docker ps -a. To inspect the output of an application just type docker logs <container-id>.

App structure and manifest.json

Your application should be structured as follows:


docker-compose.yml (best practise)
Dockerfile (mandatory)
hello_genomics (mandatory: source code)
??? __init__.py
??? logging_config.py
??? main.py
manifest.json (mandatory)
LICENSE (mandatory)
README.md (mandatory)
requirements.txt (best practise)
sample_data (mandatory)
??? config
?   ??? input_file_mapping.json
??? data
?   ??? dataset
?   ?   ??? ...
?   ?   ??? considered_genes.tsv
?   ??? other_app_uuid
?       ??? output
??? output
??? summary
templates (optional)
??? summary.md.j2
test (best practise)

FASTGenomics assumes that:

Each app has to provide a manifest.json file with the following metadata-entries:

See attached manifest.json for more information. To validate your directory structure and manifest.json just use check_my_appin the fastgenomics-py package.

Being part of an workflow
??????????????        ??????????????        ??????????????
?            ?        ?            ?        ?            ?
?  app N-1   ? ?????? ?  your app  ? ?????? ?   app N+1  ?
?  (UUID1)   ? a.txt  ?  (UUID2)   ? b.txt  ?   (UUID3)  ?
?            ?        ?            ?        ?            ?
??????????????        ??????????????        ??????????????

Your app is part of something bigger and a piece of the puzzle: One of our goals is to enable you to create a powerful analyses composed of small interchangeable applications like yours. To achieve this, every app should be as universal as possible. Also every app has to declare its in- and outputs so that we know which apps can be combined to a “workflow”.

Example: If you write a classification app, we would like to know the Type and intent (Usage-field in the manifest.json) of your input and output. As a consequence, we can avoid feeding your output into another app, which use unclustered data as input. In future releases we would like to unify these types and intents and allow for an easy to play “Lego”-like interface for your app.

Let's assume your application gets the ID UUID2 in the FASTGenomics runtime and runs after UUID1 and before UUID3. Then you can have access every output of UUID1 but not UUID3 because it needs your output to run. In the following section we describe how to access output-data from other applications or have access to the dataset.

The best method to test, if your application can be part of a workflow is by running it with sample data with the input/output of the following section.

File input / output

We use files to talk to your app. If you write a calculation app, we expect your output as files, too. Every app can expect to find these folders:

| Folder | Purpose | Mode | |—|—|—| | /fastgenomics/config/ | Here you can find your parameters and configurations | Read-only | | /fastgenomics/data/ | All input files will be located here | Read-only | | /fastgenomics/output/ | Output directory for your result-files | Read/Write | | /fastgenomics/summary/ | Store your summary here | Read/Write |

Problem: To get access to data one could just simply load the data from /fastgenomics/data/path/to/data.txt and start your calculation but that's not how FASTGenomics works: As your application (ID UUID2) is part of a larger workflow, whose applications are interchangeable, you cannot know the exact filename nor UUID at runtime. To address this problem we introduced a file mapping mechanism, in which you define unique keys under which you would like to get the actual path of the input-file/output-file.

Using the example of the aforementioned workflow and our fastgenomics-py python module, a typical input/output works as follows:

Lets start with an example: Assume you expect a normalized matrix (access-key normalized_expression_input) of the expression matrix as input (which is produced by app UUID1, a.txt) and you promise to write some data quality related file “data_quality.json” (access-key data_quality_output).

First you have to do is to define your input/output-interface in the manifest.json as follows:

manifest.json:

ut": {
    "normalized_expression_input": {
        "Type": "NormalizedExpressionMatrix",
        "Usage": "Genes Matrix with entrez IDs"
    },
    "other_input": {}

put": {
    "data_quality_output": {
        "Type": "DataQuality",
        "Usage": "Lists the number of genes for data quality overview.",
        "FileName": "data_quality.json"
    },
    "other_output": {}

Then you can access the files in your python code via:

your_code.py:

 fastgenomics import io as fg_io

alized_input_matrix = fg_io.get_input_path('normalized_expression_input')
 normalized_input_matrix.open() as f:
# do something like f.read()
pass

Analogous to the input-file-mapping you can write output-files:

your_code.py:

 fastgenomics import io as fg_io

utput_file = fg_io.get_input_path('data_quality_output')
 my_output_file.open('w') as f:
# do something like f.write('foo')
pass

Warnings:

Parameters

Your app needs to work with a variety of datasets and workflows, so baking parameters into to app is a bad idea. Furthermore, such included parameters are not visible to anyone. So please use configuration options, which are more configurable and can be included in the summary automatically. Please use them! You can set parameters and their default values in your manifest.json:

manifest.json:

ameters": {
    "delimiter": {
        "Type": "string",
        "Description": "Delimiter of the input-file",
        "Default": "\t"
    },
    "other_parameter": {}

The Type can be one of “Integer”, “String”, “Bool” or “Float”.

If you want to read parameters, we recommend using fastgenomics-py as follows:

your_code.py:

 fastgenomics import io as fg_io

meters = fg_io.get_parameters()

miter = fg_io.get_parameter('delimiter')

If you want / need to read the parameters without fastgenomics-py, the process looks like this:

  1. Read the “Parameters” section of manifest.json - this contains the parameters and default values as described above.
  2. Look at /fastgenomics/config/parameters.json, if this file does not exist you can use default values.
  3. If the file does exist - read it and overwrite values from the manifest with the values from this file. The file is a dictionary.

parameters.json Details Each key in the json object corresponds to the name you have defined in your application's manifest.json, e.g. delimiter. In contrast to the manifest.json describing the app, the parameters.json defines the parameter values that are used in the current execution of the app. For different datasets and workflows these values could be changed by the users later. Initially, the values should be set to the default as described in manifest.json.

Hints:

Summary

Reproducibility is a core goal of FASTGenomics, but it is difficult to achieve this without your help. Docker helps to freeze the exact code your app is using, but code without documentation is difficult to use, so an app is expected to have a documentation and provide a so called “summary” of its results (as Markdown). You need to store it as /fastgenomics/summary/summary.md - otherwise it would be ignored.

While a generic documentation of your application is specified in the manifest.json, we encourage you to describe the scientific meaning of the results achieved my your application in the summary in terms of a “abstract, methods and results”-section of a publication. To do so, your application needs to describe all operations applied to the data and all achieved results, which only can be described during and after runtime of your application as it doesn't know the input data yet.

For example: “… and identified 14 clusters …”

your_code.py:

 fastgenomics import io as fg_io

ary = "<content>"

ary_file = fg_io.get_summary_path()
 summary_file.open('w') as f_sum:
f_sum.write(summary)

The summary is a Markdown file your app has to write every time it runs. The file should follow these rules:

Miscellaneous
Logging

You might wonder how your app can output progress- debug information etc. There is an easy solution for this: simply write output the stdout /stderr. For example print(“hello world”), the user of your app can then see this output.

For enhanced debugging and logging we recommend logging-modules like the python logging module (see hello genomics).

To gain access to the output of your running/terminated application type: docker ps -a to list all (-a) running and terminated apps and identify the container-id of your application. Then type docker logs <container-id> to access logs.

Versions

Most of the time, you want to use version numbers to differentiate versions of your application. This version number is not included in your manifest.json, since we use a Docker feature: tags. Every Docker image has a tag, which can be used as the version number. You can see this with many images, where the part after “:” is the tag. E.g.: python:3.6.1 denotes that we use the python image, with Python version 3.6.1.

You can use any tag except :latest, but we recommend an incrementing integer or a Major.Minor.Patch scheme. Please make sure that each push to our registry uses a new tag, do not attempt to overwrite older versions!

We highly encourage you to pin all of your dependencies/requirements to ensure reproducibility.

See also Publishing for more details.

Exit-Code

Please make sure that your app terminates either with Exit code 0 (success) or a nonzero Exit code if you encountered an error. This should be normal behavior for a command line application anyway, but please check it.

User

We use a non Root User when running the app. So do not try to use a specific user in your app- Best practice: Develop you app with a non-root user, e. g. the guest account. See the docker docker-user instruction.

Publishing

Checklist:

  1. Write your code, respect the file locations as specified in this readme.
  2. Write a Dockerfile for your App.
  3. Write a manifest.json, which defines the interfaces of your app and provides some additional information. Use english for every description!
  4. Write a docker-compose.yml and provide sample_date
  5. Ship and respect licences
  6. Write the input_file_mapping.json
  7. Build and test your application by docker-compose -f <your_compose_file> build and docker-compose -f <your_compose_file> up
  8. Check your image size: docker images gives you an overview. Please go easy with image sizes as starting procedure and memory is limited. Think twice before submitting images larger than 1GB.
  9. Push your image to our registry:
    1. Contact us and login to our registry: docker login apps.fastgenomics.org -u <your username> -p <your password>

    2. We expect this naming convention for your registry and tag: apps.fastgenomics.org/#your name#/#name of your app#:#version#

      For example: apps.fastgenomics.org/teamfastgenomics/ourfirstapp:0.0.1

    3. Build your app using docker build -t <registry/image_name:tag>

    4. Push to our registry: docker push <registry/image_name:tag>

    5. For details about tagging, see docker tag.

  10. Smile: You did it! You just wrote and published your first FASTGenomics application!
Advanced topics
Input/Output

Using the example of the aforementioned workflow, a typical directory tree your app UUID2 could see under /fastgenomics/ looks like the following tree:

stgenomics/)
.
??? config
?   ??? input_file_mapping.json
?   ??? parameters.json (optional, might not exist)
??? data
?   ??? UUID1
?   ?   ??? output
?   ?       ??? a.txt
?   ??? UUID2
?   ?   ??? output
?   ?       ??? b.txt
?   ??? dataset
?       ??? cells.tsv
?       ??? data_quality.json
?       ??? expressions_entrez.tsv
?       ??? genes_considered_all.tsv
?       ??? genes_considered_expressed.tsv
?       ??? genes_considered_unexpressed.tsv
?       ??? genes_entrez.tsv
?       ??? genes_nonentrez.tsv
?       ??? genes.tsv
?       ??? genes_unconsidered_all.tsv
?       ??? genes_unconsidered_expressed.tsv
?       ??? genes_unconsidered_unexpressed.tsv
?       ??? manifest.json
?       ??? unconsidered_genes.tsv
??? output
?   ??? b.txt
??? summary
    ??? summary.md
son

ut": {
    "normalized_expression_input": {
        "Type": "NormalizedExpressionMatrix",
        "Usage": "Genes Matrix with entrez IDs"
    },
    "other_input": {}

put": {
    "data_quality_output": {
        "Type": "DataQuality",
        "Usage": "Lists the number of genes for data quality overview.",
        "FileName": "data_quality.json"
    },
    "other_output": {}

The directory “UUID3” is missing because of the order of applications: your application has to run before UUID3, hence it isn't visible yet.

The actual filename can be looked up in /fastgenomics/config/input_file_mapping.json, which looks like the following example:


"normalized_expression": "UUID1/a.txt"

This file will be created by the FASTGenomics runtime.

Hint: If you would like to test your application in a FASTGenomics runtime-like environment, you have to provide these directories and the input_file_mapping.json on your own. As mechanisms could change we highly recommend the usage of our fastgenomics-py python module as described above.


This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.