CD2H gitForager

dssg/cincinnati

Name: cincinnati

Owner: Data Science for Social Good

Description: DSaPP project with the City of Cincinnati. Building upon the DSSG15 project

Created: 2015-12-04 18:47:00.0

Updated: 2016-10-23 21:33:55.0

Pushed: 2017-06-23 04:53:26.0

Homepage:

Size: 2905

Language: Python

GitHub Committers

User	Most Recent Commit	# Commits

Other Committers

User	Email	Most Recent Commit	# Commits

README

Cincinnati Blight

This is the continuation of the Cincinnati summer project done during DSSG 2015.

About

First settled in 1788, Cincinnati is one of the oldest American cities west of the original colonies. Today, the city struggles with aging home stock, stifling economic redevelopment in some neighborhoods.

DSSG is working with the City of Cincinnati to identify properties at risk of code violations or abandonment. We hope that early intervention strategies can prevent further damage and stimulate neighborhood revitalization. Read more about our project here.

Setup

Clone the repo

Clone the repo in $ROOT_FOLDER

clone https://github.com/dssg/cincinnati $ROOT_FOLDER

Select folders for code, data, and output

The code relies on four bash environment variables (ROOT_FOLDER, DATA_FOLDER, and OUTPUT_FOLDER, and PYTHONPATH), which define where this repo, your raw data, and your outputs live. There is an example file,env_sample.sh, which looks like this:

re to store the code
rt ROOT_FOLDER="/path/to/repo/"
re data is stored
rt DATA_FOLDER="/path/to/data/"
re to output results from models
rt OUTPUT_FOLDER="/path/to/output/"

 lib folder to PYTHONPATH
rt PYTHONPATH=$PYTHONPATH:$ROOT_FOLDER/lib_cinci

Modify the three environment variables as appropriate. The PYTHONPATH line is also necessary since it includes many functions used across the project. Consider adding that to your shell profile, so they get loaded automatically, or source the file before running the pipeline.

Provide `config.yaml`, `logger_config.yaml` and `.pgpass`

The code loads some parameters from a config.yaml file stored in the $ROOT_FOLDER. This file lists your connection parameters to a Postgres DB and a Mongo DB, which will be used in throughout the pipeline. Use the config_sample.yaml file to see the structure and then rename it to config.yaml. Make sure that the file is stored in your $ROOT_FOLDER.

logger_config.yaml configures the logger for the Python interpreter. Customize it as you please; it is git-ignored.

For parts of the ETL, you will also need a .pgpass file (note the dot). This file needs to be saved as $ROOT_FOLDER/.pgpass to build it. If you are not going to use Docker, just make sure that a standard .pgpass file is on your home folder. See .pgpass_sample for syntax details. This file gives the connection parameters for your Postgres DB.

Resources

This project relies on a data dump from the City of Cincinnati. Some of the data is publicly available, and pulled from the city's open data API. Some data is private, and was delivered by the City of Cincinnati. More details on the data layout can be found in the pre-modeling folder.

The pipeline makes use of a Postgres DB, used for storing the raw data and generated features. Some of the feature generation (especially aggregations over spatial features) are computationally expensive (and not optimized), and might take a medium-sized Postgres server several days to complete. The pipeline also requires a (small) Mongo DB, which is used a logger for model outputs. Here, we used MLab for convenience.

The pipeline conducts a naive gridsearch over several hyperparameters, replicated across several temporal splits for temporal cross-validation. The model fitting happens in Python (using scikit-learn. We ran the model fitting on several large AWS machines, broken up by temporal ranges.

Overall Data Pipeline

Once you have set up your environment, you can start using the pipeline, the general procedure is the following (specific instructions for each step are available inside each subfolder):

Load data into the database
Use the pre-modeling folder to upload all the data to the database
Perform geocoding on some datasets. Use the bulk_geocoder for this.
Generate features from the data
Run some experiments. Use model.py inside model to train models. model.py requires you to provide a configuration file, see default.yaml in this folder for reference. experiments folder contains more examples.
Evaluate model performance and generate lists for field tests using the post-modeling directory.

This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.