OpenBudget/budgetkey-data-pipelines

Name: budgetkey-data-pipelines

Owner: ?????? ?? ??????

Description: Budget Key data processing pipelines

Created: 2017-02-20 16:10:38.0

Updated: 2018-05-24 13:25:54.0

Pushed: 2018-05-24 13:25:52.0

Homepage: null

Size: 3895

Language: Python

GitHub Committers

UserMost Recent Commit# Commits

Other Committers

UserEmailMost Recent Commit# Commits

README

budgetkey-data-pipelines

Build Status

Budget Key data processing pipelines

What are we doing here?

The heart of the BudgetKey project is its rich, up-to-date quality data collection. Data is collected from over 20 different data sources, cleaned, normalised, validated, combined and analysed - to create the most extensive repository of fiscal data in Israel.

In order to get that data, we have an extensive set of downloaders and scrapers which get the data from government publications and other web-sites. The fetched data is then processed and combined, and eventually saved to disk (so that people can download the raw data without hassle), loaded to a relational database (so that analysts can do in-depths queries of the data) and pushed to a key-store value (elasticsearch) which serves our main website (obudget.org).

The framework we're using to accomplish all of this is called datapackage-pipelines. This framework allows us to write simple 'pipelines', each consisting of a set of predefined processing steps. These pipelines are not coded, but rather defined in a set of YAML files. Most of the pipelines use of a set of common building-blocks, and some custom processors - mainly custom scrapers for exotic sources.

Quickstart on datapackage-pipelines

The most recommended way to start is by reading the README of datapackage-pipelineshere- it's a bit long, so at least read the beginning and skim the rest.

Then, try to write a very simple pipeline - just to test your understanding. A good task for that would be:

As you can see, pipelines are sorted by domain and data source.

Higher up the tree, we find pipelines that aggregate different sources (e.g. companies + associations + … -> all-entities). Finally, under the budgetkey directory we have pipelines that process and store the data for displaying on the website.

note: To understand a bit more on the difference between the different types of government spending, please read this excellent blog post.

What's currenty running?

To see what's the current processing status of each pipeline, just hop to the dashboard.

Developing a new pipeline
Common

TODO: Documenting our common processors

Quickstart
Installation of the Package
do apt-get install build-essential python3-dev libxml2-dev libxslt1-dev libleveldb-dev
thon --version
on 3.6.0+
do mkdir -p /var/datapackages && sudo chown $USER /var/datapackages/
ke install
dgetkey-dpp
    :Main                            :Skipping redis connection, host:None, port:6379
lable Pipelines:
budget/national/changes/original/national-budget-changes

Installing Python 3.6+

We recommend using pyenv for managing your installed python versions.

On Ubuntu, use these commands:

 apt-get install git python-pip make build-essential libssl-dev zlib1g-dev libbz2-dev libreadline-dev libsqlite3-dev
 pip install virtualenvwrapper

clone https://github.com/yyuu/pyenv.git ~/.pyenv
clone https://github.com/yyuu/pyenv-virtualenvwrapper.git ~/.pyenv/plugins/pyenv-virtualenvwrapper

 'export PYENV_ROOT="$HOME/.pyenv"' >> ~/.bashrc
 'export PATH="$PYENV_ROOT/bin:$PATH"' >> ~/.bashrc
 'eval "$(pyenv init -)"' >> ~/.bashrc
 'pyenv virtualenvwrapper' >> ~/.bashrc

 $SHELL

On OSX, you can run

 install pyenv
 'eval "$(pyenv init -)"' >> ~/.bash_profile

After installation, running:

v install 3.6.1
v global 3.6.1

Will set your Python version to 3.6.1

Running a Pipeline
dgetkey-dpp run ./entities/companies/registrar/registry

The following files will be created:

Writing Tests
unit tests
ke test
run a specific test / modify test arguments

any arguments added to tox will be added to the underlying py.test command

x tests/tenders/test_fixtures.py

tox can be a bit slow, especially when doing tdd

to run tests faster you can run py.test directly, but you will need to setup the test environment first

p install pytest
.test tests/tenders/test_fixtures.py -svk test_tenders_fixtures_publishers
Using Docker Compose

Docker Compose can be used to run a full environment with all required services - similar to the production environment.

Installation
Loading datapackages to Elasticsearch

This method allows to load the prepared datapackages to elasticsearch, the data is then available for exploration via Kibana

This snippet will delete all local docker-compose volumes - so make sure you don't have anything important there beforehand..

It loads the first 100 rows from each pipeline, you can modify ES_LIMIT_ROWS below or remove it to load all data

er-compose down -v && docker-compose pull elasticsearch db && docker-compose up -d elasticsearch db
rt DPP_DB_ENGINE="postgresql://postgres:123456@localhost:15432/postgres"
rt DPP_ELASTICSEARCH="localhost:19200"
doctype in `budgetkey-dpp | grep .budgetkey/elasticsearch/index_ | cut -d"_" -f2 - | cut -d" " -f1 -`; do
echo " > Loading ${doctype}"
ES_LOAD_FROM_URL=1 ES_LIMIT_ROWS=100 budgetkey-dpp run ./budgetkey/elasticsearch/index_$doctype

Now you can start Kibana to explore the data

er-compose up -d kibana

Kibana should be available at http://localhost:15601/ (It might take some time to start up properly)

Index name is budgetkey


This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.