Name: knesset-data-pipelines
Owner: The Public Knowledge Workshop
Description: knesset data scrapers and data sync - using the datapackage pipelines framework
Created: 2017-07-24 11:31:06.0
Updated: 2018-05-01 13:53:23.0
Pushed: 2018-05-01 13:53:21.0
Size: 1003
Language: Python
GitHub Committers
User | Most Recent Commit | # Commits |
---|
Other Committers
User | Most Recent Commit | # Commits |
---|
Knesset data scrapers and data sync
Uses the datapackage pipelines framework to scrape Knesset data and produce JSON+CSV files for useful queries.
This flow is executed periodically and resulting files are copied to Google Cloud Storage for use by the static web site generator and (in the future) oknesset APIs.
Looking to contribute? check out the Help Wanted Issues or the Noob Friendly Issues for some ideas.
Useful resources for getting acquainted:
Most pipelines are available to run locally with minimal infrastructure dependencies.
Install some dependencies (following works for latest version of Ubuntu):
apt-get install -y python3.6 python3-pip python3.6-dev libleveldb-dev libleveldb1v5
pip3 install pipenv
install the pipeline dependencies
nv install
activate the virtualenv
nv shell
Install the python module
install -e .
List the available pipelines
run a pipeline
run <PIPELINE_ID>
The Knesset API is sometimes blocked / throttled from certain IPs.
To overcome this we provide the core data available for download so pipelines that process the data don't need to call the Knesset API directly.
You can set the DATASERVICE_LOAD_FROM_URL=1
to enable download for pipelines that support it:
SERVICE_LOAD_FROM_URL=1 pipenv run dpp run ./committees/kns_committee
This is used to populate the obudget redash at http://data.obudget.org/
To test locally, start a postgresql server:
er run -d --rm --name postgresql -p 5432:5432 -e POSTGRES_PASSWORD=123456 postgres
Run the all package with dump.to_sql enabled
DB_ENGINE=postgresql://postgres:123456@localhost:5432/postgres pipenv run dpp run ./committees/all
Run the dump to db pipeline:
DB_ENGINE=postgresql://postgres:123456@localhost:5432/postgres dpp run ./knesset/dump_to_db
Start adminer to browse the data
er run -d --name adminer -p 8080:8080 --link postgresql adminer
Remove the containers when done
er rm --force adminer postgresql
er pull orihoch/knesset-data-pipelines
er run -it --entrypoint bash -v `pwd`:/pipelines orihoch/knesset-data-pipelines
Continue with Running the pipelines locally
section above
You can usually fix permissions problems on the files by running inside the docker chown -R 1000:1000 .
If you have access to the required secrets and google cloud account, you can use the following command to run with all required dependencies:
er run -d --rm --name postgresql -p 5432:5432 -e POSTGRES_PASSWORD=123456 postgres
er run -d --rm --name influxdb -p 8086:8086 influxdb
er build -t knesset-data-pipelines . &&\
er run -it -e DUMP_TO_STORAGE=1 -e DUMP_TO_SQL=1 \
-e DPP_DB_ENGINE=postgresql://postgres:123456@postgresql:5432/postgres \
-e DPP_INFLUXDB_URL=http://influxdb:8086 \
-v /path/to/google/secret/key:/secret_service_key \
--link postgresql \
knesset-data-pipelines
Run grafana to visualize metrics
er run -d --rm --name grafana -p 3000:3000 --link influxdb grafana/grafana
dataservice_collection_grafana_dashboard.json
this is similar to what the continuous deployment does
Replace UNIQUE_TAG_NAME with a unique id for the image, e.g. the name of the branch you are testing
E_TAG="gcr.io/hasadna-oknesset/knesset-data-pipelines:UNIQUE_TAG_NAME"
DSDK_CORE_PROJECT=hasadna-oknesset
ECT_NAME=knesset-data-pipelines
ud --project ${CLOUDSDK_CORE_PROJECT} container builds submit \
--substitutions _IMAGE_TAG=${IMAGE_TAG},_CLOUDSDK_CORE_PROJECT=${CLOUDSDK_CORE_PROJECT},_PROJECT_NAME=${PROJECT_NAME} \
--config continuous_deployment_cloudbuild.yaml .