hasadna/open-pension-net-scraper

Name: open-pension-net-scraper

Owner: The Public Knowledge Workshop

Description: Scrape net websites related to open pension

Created: 2016-11-14 18:34:04.0

Updated: 2017-09-07 08:37:06.0

Pushed: 2017-09-07 10:52:54.0

Homepage: null

Size: 26

Language: Python

GitHub Committers

UserMost Recent Commit# Commits

Other Committers

UserEmailMost Recent Commit# Commits

README

Open Pension Scraper

Build Status

This repo contains tools and [meta]data for the purpose of extracting publicly-available online information to be used by Open Pension.

Open Pension is a “Hasadna” project, aiming to reveal the secrets behind the Israeli pension market.

Pre Requirements
Installation

Note: if you source venv/bin/activate in a shell, you can skip the ./envrun.sh in commands here (still, it's handy if you open a shell only to run rq worker).

If you want rq-dashboard (for monitoring batch jobs via browser):

Running
Dump portfolio of a single month

[this is something you don't need redis for]

For exmaple: ./envrun.sh python -m web-sources.gemelnet 101 2016 1 would write Jan 2016 portfolio of kupa 101 to data/gemelnet-monthly-portfolios/101-2016-01.csv.

Pensianet allows dumping portfolios of all kranot in a single request: ./envrun.sh python -m web-sources.pensianet 2016 1 would write data/pensianet/2016-01.csv containing portfolios of all kranot for Jan 2016.

Batch dump reports over a period

run these on separate shells:

Dumping portfolios

For example: ./envrun.sh python batch_gemelnet.py 101 1999 8 2002 4 would queue jobs that dump portfolios of kupa 101 for all months between Aug 1999 and April 2002 into data/gemelnet-monthly-portfolios/101-1999-08.csvdata/gemelnet-monthly-portfolios/101-2002-04.csv

Dumping performance reports

For example: ./envrun.sh python batch_gemelnet.py 101 1999 8 2002 4 --type p (or ./envrun.sh python batch_gemelnet.py 101 1999 8 2002 4 -t p) would queue a single job to fetch a performance report for kupa 101 between Aug 1999 and April 2002 and dump it into data/gemelnet/perf-101-1999-08-2002-04.csv.

There's no batch_pensianet.py [yet?], but you can get performance for a 12 month period by using (for example) ./envrun.sh python -m web_sources.pensianet 2016 10 --type p (or -t p). This would dump 11/2015-10/2016 performance of all kranot to data/pensianet/perf-2015-11-2016-10.csv. [This doesn't require redis].

Monthly incremental dumping

./dump-latest.sh [N] queues dumps of portfolios for all kupot N months ago, and performance reports for all kupot for the year between N+11 and N months ago.

It also dumps portfolios and performance of all pensianet kranot for the same periods, but these appear as single csv files (as opposed to a file per kupa in gemelnet).

Default for N is 2 (data for last month isn't available early in the month, while data for 2 months ago is always available).

Monitoring/controlling job execution

You can suspend job execution with ./envrun.sh rq suspend and resume work with ./envrun.sh rq resume

Utilities

[Don't require redis]

Generate CSV with totals of all Gemelnet portfolios for a given month

For example: ./envrun.sh gemelnet_totals.py 2016 9 would generate data/gemelnet/totals-2016-09.csv with “bottom lines” of all {kupa id}-2016-09.csv portfolios.

Batch lookup stock names at quotenet

For example: ./envrun.sh python batch_quotenet_stock_names.py < sample-data/quotenet-queries.txt > something.csv should generate something similar to sample-data/quotenet-output.csv.

Tests

Only pep8 so far ;)

Contribute

Just fork and do a pull request (;


This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.