Name: dipper
Owner: Monarch Initiative
Description: Data Ingestion Pipeline for SciGraph
Created: 2014-10-25 16:49:58.0
Updated: 2017-11-28 19:45:53.0
Pushed: 2018-01-04 22:53:03.0
Homepage: null
Size: 80350
Language: Python
GitHub Committers
User | Most Recent Commit | # Commits |
---|
Other Committers
User | Most Recent Commit | # Commits |
---|
Dipper is a pure Python package to generate RDF triples from common scientific resources. Dipper includes subpackages and modules to create graphical models of this data, including:
Models package for generating common sets of triples, including common OWL axioms, complex genotypes, associations, evidence and provenance models.
Graph package for building graphs with RDFLib or streaming n-triples
Source package containing fetchers and parsers that interface with remote databases and web services
The dipper main wraps all of the source parsers, enabling users to specify one or more sources to process.
The general strategy is that there is one class per data source. We define the files to be fetched,
any file scrubbing, and then the parsing methods. As the files are parsed, triples are loaded into an in-memory graph.
This graph is then typically dumped into triples in turtle format. For testing purposes,
a subset of the graph is also dumped to *_test.ttl.
Data generated from this pipeline can be used in a variety of ways downstream. We recommend loading the data into a graph database that is optimized for use with ontologies, such as SciGraph. Smaller .ttl files can be loaded into an ontology editor like Protege.
Python 3 or higher (and therefore pip3 if using pip)
One of the unit tests requires owltools be available on your path. You could modify the code to skip this, if necessary
Running make test requires nosetests (if on OS X you may need to sudo pip3 install nose
)
Required external python packages:
Optional source specific python packages:
Note, Dipper imports source modules dynamically at runtime. As a result it is possible to build a core set of requirements and add source specific dependencies as needed. Presently this only implemented with pip requirements files. For example to build dependencies for MGI:
pip3 install -r requirements.txt
pip3 install -r requirements/mgi.txt
To install dependencies for all sources:
pip3 install -r requirements.txt
pip3 install -r requirements/all-sources.txt
If you encounter any errors installing these packages using Homebrew, it could be due to a curent known issue in upgrading to pip3. In this case, first force reinstall pip2 (pip2 install –upgrade –force-reinstall pip) and then install the package using pip3 (eg. pip3 install psycopg2.)
you can run the code by supplying a list of one or more sources on the command line. some examples:
furthermore, you can check things out by supplying a limit. this will only process the first N number of rows or data elements
you can also run the stand-alone tests in `tests/test_*
` to generate subsets of the data and run unittests
other commandline parameters are explained if you request help:
You can also write your own dipper packages outside of this project, using the framework we've set up here. Simply import Dipper as a python package, write your own wrapper, and add your own source parsers.
as an external python package with pip3
or clone the repository and run:
The following sources have been mapped:
Each source has a corresponding script at https://github.com/monarch-initiative/dipper/tree/master/dipper/sources
,zfin,omim,biogrid,mgi,impc,panther,ncbigene,ucscbands,
genereviews,eom,coriell,clinvar,monochrom,kegg,animalqtldb,
mbl,hgnc,orphanet,omia,flybase,mmrrc,wormbase,mpd,gwascatalog,go
Each source also has a corresponding concept map diagram that documents modeling patterns implemented in SciGraph, via Dipper-mediated transformation into Monarch's common target model. These are stored in the ingest-artifacts repo at https://github.com/monarch-initiative/ingest-artifacts/tree/master/sources.
Don't see a parser you want? Feel free to request a new one, or you could contribute a Source parser to our suite!
Please see our best-practices documentation for details on writing new Source parsers
using Dipper code, and make a Pull request.
Our identifier documentation as referenced in our recent paper on identifiers(doi:10.1371/journal.pbio.2001414)[https://doi.org/10.1371/journal.pbio.2001414] has been moved to https://github.com/monarch-initiative/monarch-app/blob/master/README.md#identifiers
The DIPper data pipeline was born out of the need for a uniform representation of human and model organism
genotype-to-phenotype data, and an easy Extract-Transform-Load (ETL) pipeline to process it all.
It became too cumbersome to first get all of these data into a single-schema traditional SQL database,
then transform it into a graph representation. So, we decided to go straight from each source into triples that
are semantically captured, using standard modeling patterns.
Furthermore, we wanted to provide the bioinformatics community with a set of scripts to help anyone
get started transforming these standard data sources.
A manuscript is in preparation. In the mean time, if you use any of our code or derived data, please cite this repository and the Monarch Initiative.