icgc-dcc/ega_script

Name: ega_script

Owner: ICGC DCC

Description: null

Created: 2017-07-25 23:08:24.0

Updated: 2017-07-27 19:11:42.0

Pushed: 2017-08-15 14:58:38.0

Homepage: null

Size: 62

Language: Python

GitHub Committers

UserMost Recent Commit# Commits

Other Committers

UserEmailMost Recent Commit# Commits

README

ega_script

The tool is used to:

Getting Started

The tool needs to talk to two different kinds of git repository to gather the information to realize the above tasks.

Prerequisites

Before you can run the tool, you need to configure the tool. The configuration file locates ega_script/conf/conf.yaml. You may need to change the following two base_path for ega_audit and ega_job respectively.

audit_base_path: "../ega-file-transfer"
job_base_path: ".."

The above default configuration will assume:

e ega auditing git repository is version controlled, before we can run the tool to generate the jobs and report `to_stage` or `to_remove` files, we also need to set the version of the ega auditing reports in the `conf/conf.yaml`, e.g.,

file_version: “v20170630”

Installing

the source script of the tool

git clone git@github.com:lindaxiang/ega_script.git

 you can run `./main.py -h` to get the usage of the tool

usage: main.py [-h] [-c CONF] -t TASK [-p [PROJECT [PROJECT …]]]

           [-s [SEQ_STRATEGY [SEQ_STRATEGY ...]]]

EGA-file-to-colllab job generator and auditor

optional arguments: -h, –help show this help message and exit -c CONF, –setting CONF

                    Specify ega setting file

-t TASK, –task TASK Specify the task -p [PROJECT [PROJECT …]], –project [PROJECT [PROJECT …]]

                    Specify the project

-s [SEQ_STRATEGY [SEQ_STRATEGY …]], –seq_strategy [SEQ_STRATEGY [SEQ_STRATEGY …]]

                    Specify the sequencing strategy
unning the tool to generate the jobs 
example generating jobs for `RNA-Seq` data of project `CLLE-ES`, do this: 

cd ega_script ./main.py -t job -p CLLE-ES -s RNA-Seq

 no `project` is specified, the tool will generate the eligible jobs for all the projects which have auditing reports available.
 no `seq_strategy` is specified, the tool will generate the eligible jobs for all kinds of seq_strategy which are included in the related auditing reports. 
e generated jobs locates in `job_state.backlog` of one of the job repositories which is defined in the `conf/conf.yaml`, you can change the `job folder` if needed:

job_folder: “ega-file-transfer-to-collab-jtracker/ega-file-transfer-to-collab.0.6.jtracker/job_state.backlog”

unning the tool to generate the `to_stage` files
rder to get the list of files which are to be staged to Aspera server by EGA, do this:

cd ega_script ./main.py -t stage

can specify the `project` and `seq_strategy` in order to get the list of files which are only for given sequence strategies and belong to given projects.
tool will generate `to_stage_*.tsv` files under each project. For example:

ega_operation/ ??? BRCA-KR ?   ??? to_stage_run.tsv ??? CLLE-ES ?   ??? to_stage_run.tsv ??? LICA-FR ?   ??? to_stage_run.tsv ??? MALY-DE ?   ??? to_stage_analysis.tsv ??? OV-AU ?   ??? to_stage_analysis.tsv ??? PACA-AU ?   ??? to_stage_analysis.tsv ?   ??? to_stage_run.tsv ??? PAEN-AU ?   ??? to_stage_analysis.tsv ??? to_remove.tsv

unning tool to generate the `to_remove` files
rder to list all files which can be removed from Aspera server by EGA, do this:

cd ega_script ./main.py -t remove

tool will generate `to_remove.txt` file locating at: `ega-file-transfer/ega_operation/to_remove.tsv`

og information
 using the tool to generate the jobs or report the `to_stage` or `to_remove` files, the tool did many QC checks based on the auditing reports, the QC results are logged into the `*.log` files locates:

ega_script/log/ ??? error.log ??? info.log ??? warn.log

 are some sample log messages:

2017-07-25 15:12:45,689 - audit.stage - WARNING - LICA-FR::EGAF00000483937 has the same file_md5sum and encrypted_file_md5sum: set(['772febc5f8fea25a9b09e43dd51e43bd']) 2017-07-25 15:12:45,690 - audit.stage - WARNING - LICA-FR::EGAF00000483938 has the same file_md5sum and encrypted_file_md5sum: set(['170588f8a583c2d4fee882fdfcb6133b']) 2017-07-25 15:12:45,690 - audit.stage - WARNING - LICA-FR::EGAF00000483899 has the same file_md5sum and encrypted_file_md5sum: set(['83aed772452945dc994bcfad7edebc3a']) 2017-07-25 15:12:49,248 - audit.stage - WARNING - MALY-DE::EGAF00001592148 has the id inconsistent: ega_analysis_id in audit report version v20170630 2017-07-25 15:12:49,248 - audit.stage - WARNING - MALY-DE::EGAF00001592148 has the id inconsistent: file_name in audit report version v20170630 2017-07-25 15:12:49,248 - audit.stage - WARNING - MALY-DE::EGAF00001592148 has the id inconsistent: encrypted_file_md5sum in audit report version v20170630

uthors

Linda Xiang** - *Initial work* 

This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.