AAFC-BICoE/nrc-ngs-downloader

Name: nrc-ngs-downloader

Owner: Biological Informatics CoE @ Agriculture and Agri-Food Canada

Owner: Biological Informatics CoE @ Agriculture and Agri-Food Canada

Description: nrc-ngs-downloader is a software written in Python. This software explores the NRC-LIMS website, downloads all the sequence files, and keeps the meta data of all the sequences in a sqlite database.

Created: 2017-10-20 17:28:41.0

Updated: 2017-10-20 19:31:52.0

Pushed: 2017-10-31 12:56:40.0

Homepage: null

Size: 72

Language: Python

GitHub Committers

UserMost Recent Commit# Commits

Other Committers

UserEmailMost Recent Commit# Commits

README

NRC-LIMS-dataDownloader

Description

NRC-LIMS-Datadownloader is a software written in Python. This software explores the NRC-LIMS website, downloads all the sequence files, and keeps the meta data of all the sequences in a sqlite database.

The list of the tasks performed by the software:

  1. Scrapes the NRC-LIMS website to get a list of all the completed sequence runs and all information related to sequence runs and sequence files.
  2. Obtains new runs that were not been previously downloaded or re-processed/modified runs by checking each sequence run against the database.
  3. Download each new/re-processed run's data and subsequently unzips the file to obtain demultiplexed fastq files
  4. Renames each fastq file to the submitted sample name from the sequencing run information page.
  5. Generates a SHA256 code for each fastq file and gzips the file
  6. Inserts information about newly downlaoded runs and files into database
Requirements
Deployment Procedures
Set up the HCRON service
SQLite database

Three tables are maintained in this database. Tables will be updated when the program is run.

  1. data_packages: to keep all the information about each sequence run (run-name,….)
  2. data_files: to keep all the information about each sequence file, include information scrapped from webpage, checksum(SHA256), original name and new name of the file, etc.
  3. program_action: to keep all the information of every time the application is run, like failures, successes, urls scraped/attempted, timestamps, sequence runs downloaded.

This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.