Name: library-beam
Owner: Siren
Description: Open Targets Library ETL Pipeline | Apache Beam
Created: 2018-03-14 16:48:38.0
Updated: 2018-04-06 16:38:14.0
Pushed: 2018-03-20 17:14:48.0
Homepage: null
Size: 538
Language: Python
GitHub Committers
User | Most Recent Commit | # Commits |
---|
Other Committers
User | Most Recent Commit | # Commits |
---|
This pipeline is designed to run with Apache Beam using the dataflow runner. It has not been tested with other Beam backends, but it should work there as well pending minimal modifications. Please see the Apache Beam SDK for more info.
Use python 2
Generate a mirror of MEDLINE FTP to a Google Storage Bucket (any other storage provider supported by Python Beam SDK should work). E.g. using rclone
configure rclone with MEDLINE FTP ftp.ncbi.nlm.nih.gov and your target gcp project
(my-gcp-project-buckets) rclone config
Generate a full mirror:
rclone sync medline-ftp:pubmed my-gcp-project-buckets:my-medline-bucket
Update new files:
rclone sync medline-ftp:pubmed/updatefiles my-gcp-project-buckets:my-medline-bucket/updatefiles
Install the pipeline locally
clone https://github.com/opentargets/library-beam
ibrary-beam
o) pip install virtualenv
ualenv venv
ce venv/bin/activate
install --upgrade setuptools pip
on setup.py install
install https://github.com/explosion/spacy-models/releases/download/en_depent_web_md-1.2.1/en_depent_web_md-1.2.1.tar.gz
Run NLP analytical pipeline
on -m main \
project your-project \
job_name medline-nlp\
runner DataflowRunner \
temp_location gs://my-tmp-bucket/temp \
setup_file ./setup.py \
worker_machine_type n1-highmem-32 \
input_baseline gs://my-medline-bucket/baseline/pubmed18n*.xml.gz \
input_updates gs://my-medline-bucket/updatefiles/pubmed18n*.xml.gz \
output_enriched gs://my-medline-bucket-output/analyzed/pubmed18 \
max_num_workers 32 \
zone europe-west1-d
Run job to split Enriched JSONs in smaller pieces
on -m main \
project open-targets \
job_name open-targets-medline-process-split\
runner DataflowRunner \
temp_location gs://my-tmp-bucket/temp \
setup_file ./setup.py \
worker_machine_type n1-highmem-16 \
input_enriched gs://my-medline-bucket/analyzed/pubmed18*_enriched.json.gz \
output_splitted gs://my-medline-bucket/splitted/pubmed18 \
max_num_workers 32 \
zone europe-west1-d
NOTE: you can chain the analytical and the split steps by adding the option --output_splitted gs://my-medline-bucket/splitted/pubmed18
to the analytical step
Run job load JSONs in Elasticsearch
on load2es.py publication --es http://myesnode1:9200 --es http://myesnode2:9200
on load2es.py bioentity --es http://myesnode1:9200 --es http://myesnode2:9200
on load2es.py taggedtext --es http://myesnode1:9200 --es http://myesnode2:9200
on load2es.py concept --es http://myesnode1:9200 --es http://myesnode2:9200
WARNING: the loading scripts takes a lot of time currently, particurlarly the concept one (16h in our system). It might be a good idea to use tmux to load the data, so it will keep going while you are not there looking at it. E.g. after installing tmux
new-session "python load2es.py publication --es http://myesnode1:9200 --es http://myesnode2:9200"
new-session "python load2es.py bioentity --es http://myesnode1:9200 --es http://myesnode2:9200"
new-session "python load2es.py taggedtext --es http://myesnode1:9200 --es http://myesnode2:9200"
new-session "python load2es.py concept --es http://myesnode1:9200 --es http://myesnode2:9200"
OPTIONAL: If needed create appropriate aliases in elasticsearch
-XPOST 'http://myesnode1:9200/_aliases' -H 'Content-Type: application/json' -d '
"actions": [
{"add": {"index": "pubmed-18", "alias": "!publication-data"}}
]
OPTIONAL: Increase elasticsearch capacity for the adjancency matrix aggregation (used by LINK tool)
-XPUT 'http://myesnode1:9200/pubmed-18-concept/_settings' -H 'Content-Type: application/json' -d'
"index" : {
"max_adjacency_matrix_filters" : 500
}