Name: hello_genomics_calc_py36
Owner: FASTGenomics
Description: Sample Python Calculation
Created: 2017-10-10 14:24:22.0
Updated: 2017-10-31 13:24:27.0
Pushed: 2018-01-15 15:48:20.0
Size: 121
Language: Python
GitHub Committers
User | Most Recent Commit | # Commits |
---|---|---|
Philipp A. | 2018-01-10 12:28:02.0 | 2 |
Henning Dickten | 2018-03-06 14:33:48.0 | 18 |
Gerhard Schlemm | 2017-12-11 12:17:46.0 | 1 |
Other Committers
User | Most Recent Commit | # Commits | |
---|---|---|---|
Christian Sauer | christian.sauer@comma-soft.com | 2017-12-11 12:25:24.0 | 5 |
o O o o .oOOOo.
O o O O .O o o
o O o o o
OoOooOOo O O O
o O .oOo. o o .oOo. O .oOOo .oOo. 'OoOo. .oOo. `oOOoOO. O .oOo .oOo
O o OooO' O O O o o. O OooO' o O O o O o o o O `Ooo.
o o O o o o O O. oO O O o o O o O O O o O
o O `OoO' Oo Oo `OoO' `OooO' `OoO' o O `OoO' O o o o' `OoO' `OoO'
Writing an app is fairly easy - there are some conventions you need to know, but otherwise you are free to use any language, tools and methods you want. This document explains the basic structure using the example of the python “Hello Genomics” app and explains the workflow to get an app published and outlines core concepts.
There are two flavors of apps in FASTGenomics: Calculations and Visualizations. “Calculations” perform data-intensive tasks, for example clustering whereas “visualizations” display the aforementioned results. A visualization might take a clustering result and display a diagram for the user.
In the following section we'll give you an overview how to write your own FASTGenomics App, that can be used in analyses, and provide information about the ingredients you need:
Every application runs in the FASTGenomics runtime in the form of an own docker container (which you can imagine as self-sustaining, portable workplaces). Using docker containers helps us to eliminate the “works on my machine” problems and afford full reproducibility and transparency. Moreover using docker containers allows you to use any programming language and framework you want to achieve your results and makes things simple for us integrating your app into the analyses if you want to. You like Python? So do we. You are an Haskel or Julia expert? Just use it! Do you have a special configuration, which is extremely complicated or annoying to install? Just do it once and your app will work everywhere.
You never heard of Docker before? Read the article Docker Overview.
These are the very small number of things you really need to know:
Dockerfile
: This is the construction plan of your application: Here you decide what to COPY
into, RUN
and execute (CMD
) within your container.docker-compose.yml
file: This file describes, how to build and start your container and providing input/output directories (volumes) for your container.
Have a closer look at our example in order to test your application in a FASTGenomics-runtime-like environment.In order to build and test your container proceed as follows:
docker-compose -f <docker-compose.filename.yml> build
docker-compose.yml
.
We recommend relative paths.docker-compose -f <docker-compose.filename.yml> up
You already have a working python-script? Just clone hello-genomics and interchange the main.py, rename the directory, and modify the paths in the Dockerfile.
One more thing: Once you started your application (container) you can list all current instances via docker ps -a
.
To inspect the output of an application just type docker logs <container-id>
.
Your application should be structured as follows:
docker-compose.yml (best practise)
Dockerfile (mandatory)
hello_genomics (mandatory: source code)
??? __init__.py
??? logging_config.py
??? main.py
manifest.json (mandatory)
LICENSE (mandatory)
README.md (mandatory)
requirements.txt (best practise)
sample_data (mandatory)
??? config
? ??? input_file_mapping.json
??? data
? ??? dataset
? ? ??? ...
? ? ??? considered_genes.tsv
? ??? other_app_uuid
? ??? output
??? output
??? summary
templates (optional)
??? summary.md.j2
test (best practise)
FASTGenomics assumes that:
manifest.json
is present in the root directoryLICENSE
text is present in the root directoryDockerfile
is present in the root directory and defines a default command via CMD
or entry_point
sample_data
is present and available for testing (together with a docker-compose.yml
)Each app has to provide a manifest.json
file with the following metadata-entries:
See attached manifest.json for more information.
To validate your directory structure and manifest.json just use check_my_app
in the fastgenomics-py package.
?????????????? ?????????????? ??????????????
? ? ? ? ? ?
? app N-1 ? ?????? ? your app ? ?????? ? app N+1 ?
? (UUID1) ? a.txt ? (UUID2) ? b.txt ? (UUID3) ?
? ? ? ? ? ?
?????????????? ?????????????? ??????????????
Your app is part of something bigger and a piece of the puzzle: One of our goals is to enable you to create a powerful analyses composed of small interchangeable applications like yours. To achieve this, every app should be as universal as possible. Also every app has to declare its in- and outputs so that we know which apps can be combined to a “workflow”.
Example: If you write a classification app, we would like to know the Type
and intent (Usage
-field in the manifest.json) of your input and output.
As a consequence, we can avoid feeding your output into another app, which use unclustered data as input.
In future releases we would like to unify these types and intents and allow for an easy to play “Lego”-like interface for your app.
Let's assume your application gets the ID UUID2
in the FASTGenomics runtime and runs after UUID1 and before UUID3.
Then you can have access every output of UUID1 but not UUID3 because it needs your output to run.
In the following section we describe how to access output-data from other applications or have access to the dataset.
The best method to test, if your application can be part of a workflow is by running it with sample data with the input/output of the following section.
We use files to talk to your app. If you write a calculation app, we expect your output as files, too. Every app can expect to find these folders:
| Folder | Purpose | Mode | |—|—|—| | /fastgenomics/config/ | Here you can find your parameters and configurations | Read-only | | /fastgenomics/data/ | All input files will be located here | Read-only | | /fastgenomics/output/ | Output directory for your result-files | Read/Write | | /fastgenomics/summary/ | Store your summary here | Read/Write |
Problem:
To get access to data one could just simply load the data from /fastgenomics/data/path/to/data.txt
and start your calculation but that's not how FASTGenomics works:
As your application (ID UUID2
) is part of a larger workflow, whose applications are interchangeable, you cannot know the exact filename nor UUID at runtime.
To address this problem we introduced a file mapping mechanism, in which you define unique keys under which you would like to get the actual path of the input-file/output-file.
Using the example of the aforementioned workflow and our fastgenomics-py python module, a typical input/output works as follows:
Lets start with an example:
Assume you expect a normalized matrix (access-key normalized_expression_input
) of the expression matrix as input (which is produced by app UUID1, a.txt)
and you promise to write some data quality related file “data_quality.json” (access-key data_quality_output
).
First you have to do is to define your input/output-interface in the manifest.json
as follows:
manifest.json:
ut": {
"normalized_expression_input": {
"Type": "NormalizedExpressionMatrix",
"Usage": "Genes Matrix with entrez IDs"
},
"other_input": {}
put": {
"data_quality_output": {
"Type": "DataQuality",
"Usage": "Lists the number of genes for data quality overview.",
"FileName": "data_quality.json"
},
"other_output": {}
Then you can access the files in your python code via:
your_code.py:
fastgenomics import io as fg_io
alized_input_matrix = fg_io.get_input_path('normalized_expression_input')
normalized_input_matrix.open() as f:
# do something like f.read()
pass
Analogous to the input-file-mapping you can write output-files:
your_code.py:
fastgenomics import io as fg_io
utput_file = fg_io.get_input_path('data_quality_output')
my_output_file.open('w') as f:
# do something like f.write('foo')
pass
Warnings:
Your app needs to work with a variety of datasets and workflows, so baking parameters into to app is a bad idea. Furthermore, such included parameters are not visible to anyone. So please use configuration options, which are more configurable and can be included in the summary automatically. Please use them!
You can set parameters and their default values in your manifest.json
:
manifest.json:
ameters": {
"delimiter": {
"Type": "string",
"Description": "Delimiter of the input-file",
"Default": "\t"
},
"other_parameter": {}
The Type can be one of “Integer”, “String”, “Bool” or “Float”.
If you want to read parameters, we recommend using fastgenomics-py as follows:
your_code.py:
fastgenomics import io as fg_io
meters = fg_io.get_parameters()
miter = fg_io.get_parameter('delimiter')
If you want / need to read the parameters without fastgenomics-py, the process looks like this:
parameters.json Details
Each key in the json object corresponds to the name you have defined in your application's manifest.json, e.g. delimiter
.
In contrast to the manifest.json
describing the app, the parameters.json
defines the parameter values that are used in the current execution of the app.
For different datasets and workflows these values could be changed by the users later. Initially, the values should be set to the default as described in manifest.json.
Hints:
manifest.json
Reproducibility is a core goal of FASTGenomics, but it is difficult to achieve this without your help.
Docker helps to freeze the exact code your app is using, but code without documentation is difficult to use,
so an app is expected to have a documentation and provide a so called “summary” of its results (as Markdown).
You need to store it as /fastgenomics/summary/summary.md
- otherwise it would be ignored.
While a generic documentation of your application is specified in the manifest.json, we encourage you to describe the scientific meaning of the results achieved my your application in the summary in terms of a “abstract, methods and results”-section of a publication. To do so, your application needs to describe all operations applied to the data and all achieved results, which only can be described during and after runtime of your application as it doesn't know the input data yet.
For example: “… and identified 14 clusters …”
your_code.py:
fastgenomics import io as fg_io
ary = "<content>"
ary_file = fg_io.get_summary_path()
summary_file.open('w') as f_sum:
f_sum.write(summary)
The summary is a Markdown file your app has to write every time it runs. The file should follow these rules:
manifest.json
, don't use constants besides parametersmanifest.json
manifest.json
You might wonder how your app can output progress- debug information etc. There is an easy solution for this: simply write output the stdout /stderr. For example print(“hello world”), the user of your app can then see this output.
For enhanced debugging and logging we recommend logging-modules like the python logging
module (see hello genomics).
To gain access to the output of your running/terminated application type:
docker ps -a
to list all (-a
) running and terminated apps and identify the container-id of your application.
Then type docker logs <container-id>
to access logs.
Most of the time, you want to use version numbers to differentiate versions of your application.
This version number is not included in your manifest.json, since we use a Docker feature: tags.
Every Docker image has a tag, which can be used as the version number.
You can see this with many images, where the part after “:” is the tag.
E.g.: python:3.6.1
denotes that we use the python image, with Python version 3.6.1.
You can use any tag except :latest
, but we recommend an incrementing integer or a Major.Minor.Patch scheme.
Please make sure that each push to our registry uses a new tag, do not attempt to overwrite older versions!
We highly encourage you to pin all of your dependencies/requirements to ensure reproducibility.
See also Publishing for more details.
Please make sure that your app terminates either with Exit code 0 (success) or a nonzero Exit code if you encountered an error. This should be normal behavior for a command line application anyway, but please check it.
We use a non Root User when running the app. So do not try to use a specific user in your app- Best practice: Develop you app with a non-root user, e. g. the guest account. See the docker docker-user instruction.
Checklist:
docker-compose -f <your_compose_file> build
and docker-compose -f <your_compose_file> up
docker images
gives you an overview.
Please go easy with image sizes as starting procedure and memory is limited.
Think twice before submitting images larger than 1GB.Contact us and login to our registry:
docker login apps.fastgenomics.org -u <your username> -p <your password>
We expect this naming convention for your registry and tag: apps.fastgenomics.org/#your name#/#name of your app#:#version#
For example: apps.fastgenomics.org/teamfastgenomics/ourfirstapp:0.0.1
Build your app using docker build -t <registry/image_name:tag>
Push to our registry: docker push <registry/image_name:tag>
For details about tagging, see docker tag.
Using the example of the aforementioned workflow, a typical directory tree your app UUID2
could see under /fastgenomics/
looks like the following tree:
stgenomics/)
.
??? config
? ??? input_file_mapping.json
? ??? parameters.json (optional, might not exist)
??? data
? ??? UUID1
? ? ??? output
? ? ??? a.txt
? ??? UUID2
? ? ??? output
? ? ??? b.txt
? ??? dataset
? ??? cells.tsv
? ??? data_quality.json
? ??? expressions_entrez.tsv
? ??? genes_considered_all.tsv
? ??? genes_considered_expressed.tsv
? ??? genes_considered_unexpressed.tsv
? ??? genes_entrez.tsv
? ??? genes_nonentrez.tsv
? ??? genes.tsv
? ??? genes_unconsidered_all.tsv
? ??? genes_unconsidered_expressed.tsv
? ??? genes_unconsidered_unexpressed.tsv
? ??? manifest.json
? ??? unconsidered_genes.tsv
??? output
? ??? b.txt
??? summary
??? summary.md
son
ut": {
"normalized_expression_input": {
"Type": "NormalizedExpressionMatrix",
"Usage": "Genes Matrix with entrez IDs"
},
"other_input": {}
put": {
"data_quality_output": {
"Type": "DataQuality",
"Usage": "Lists the number of genes for data quality overview.",
"FileName": "data_quality.json"
},
"other_output": {}
The directory “UUID3” is missing because of the order of applications: your application has to run before UUID3, hence it isn't visible yet.
The actual filename can be looked up in /fastgenomics/config/input_file_mapping.json
, which looks like the following example:
"normalized_expression": "UUID1/a.txt"
This file will be created by the FASTGenomics runtime.
Hint: If you would like to test your application in a FASTGenomics runtime-like environment, you have to provide these directories and the input_file_mapping.json on your own. As mechanisms could change we highly recommend the usage of our fastgenomics-py python module as described above.