monarch-initiative/dipper

Name: dipper

Owner: Monarch Initiative

Description: Data Ingestion Pipeline for SciGraph

Created: 2014-10-25 16:49:58.0

Updated: 2017-11-28 19:45:53.0

Pushed: 2018-01-04 22:53:03.0

Homepage: null

Size: 80350

Language: Python

GitHub Committers

UserMost Recent Commit# Commits

Other Committers

UserEmailMost Recent Commit# Commits

README

PyPI Build Status

DIPPER

Dipper is a pure Python package to generate RDF triples from common scientific resources. Dipper includes subpackages and modules to create graphical models of this data, including:

Requirements

Note, Dipper imports source modules dynamically at runtime. As a result it is possible to build a core set of requirements and add source specific dependencies as needed. Presently this only implemented with pip requirements files. For example to build dependencies for MGI:

    pip3 install -r requirements.txt
    pip3 install -r requirements/mgi.txt

To install dependencies for all sources:

    pip3 install -r requirements.txt
    pip3 install -r requirements/all-sources.txt

If you encounter any errors installing these packages using Homebrew, it could be due to a curent known issue in upgrading to pip3. In this case, first force reinstall pip2 (pip2 install –upgrade –force-reinstall pip) and then install the package using pip3 (eg. pip3 install psycopg2.)

Running Dipper:
Installing Dipper as an external python package:

You can also write your own dipper packages outside of this project, using the framework we've set up here. Simply import Dipper as a python package, write your own wrapper, and add your own source parsers.

Sources:
Identifiers

Our identifier documentation as referenced in our recent paper on identifiers(doi:10.1371/journal.pbio.2001414)[https://doi.org/10.1371/journal.pbio.2001414] has been moved to https://github.com/monarch-initiative/monarch-app/blob/master/README.md#identifiers

About this project

The DIPper data pipeline was born out of the need for a uniform representation of human and model organism genotype-to-phenotype data, and an easy Extract-Transform-Load (ETL) pipeline to process it all.
It became too cumbersome to first get all of these data into a single-schema traditional SQL database, then transform it into a graph representation. So, we decided to go straight from each source into triples that are semantically captured, using standard modeling patterns.
Furthermore, we wanted to provide the bioinformatics community with a set of scripts to help anyone get started transforming these standard data sources.

A manuscript is in preparation. In the mean time, if you use any of our code or derived data, please cite this repository and the Monarch Initiative.


This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.