newsdev/compstat_parser

Name: compstat_parser

Owner: NYT Newsroom Developers

Description: Parse the NYPD's weekly per-precinct crime complaints stats to CSV or MySQL

Created: 2015-02-06 21:35:04.0

Updated: 2018-04-21 19:01:41.0

Pushed: 2017-04-07 20:33:31.0

Homepage: http://open.blogs.nytimes.com/2015/04/03/purifying-the-sea-of-pdf-data-automatically/

Size: 41

Language: Ruby

GitHub Committers

UserMost Recent Commit# Commits

Other Committers

UserEmailMost Recent Commit# Commits

README

CompStat Parser

New York Police Department Complaint Statistics Scraper & Parser

This collection of tools scrapes the N.Y.P.D.'s CompStat site, downloads the crime stats that are published as PDFs each week for each precinct, then parses them into actual data – CSVs.

This tool is designed to minimally output a CSV with current-week crime data, but more advanced options are available too.

After you install this tool (see below), you will be able to download the most recent PDFs from the NYPD's site and generate a CSV.

Installation

Run the commands following a $ on the command line (like Terminal on a Mac). This assumes you have a Ruby version manager (like RVM or rbenv) and MySQL already installed on your machine.

t clone git@github.com:nytinteractive/compstat_parser.git
 compstat_parser
env install jruby-1.7.16 # or another recent JRuby version
env local jruby-1.7.16
eate database compstat # creates a database in MySQL called "compstat"

Optionally, fill in config.yml (based on the details in config.example.yml) if you want a database or PDFs saved to S3 (See the “Configuration Options” section below for more information.)

ndle install
mpstat_scraper.rb #once the scraper is installed, execute it
Usage

Note that if you run the script multiple times without a database, rows will be duplicated in the CSV. You should dedupe it with UNIX's uniq tool, in Sublime Text or in Excel.

Advanced Options

This tool can also seamlessly run weekly via cron and interface with Amazon S3 for storage of PDFs and MySQL (or RDS) for stats. It can send you emails if crime stats aren't posted when you expect them to be, via Amazon's Simple Notifications Service (SNS). These options are set in a config file, config.yml.

Depending on whether you're trying to parse locally-stored old PDFs or scrape and parse the N.Y.P.D.'s most current, this library supplies two additional executables (in src/bin/) :

A fouth executable, checks if data is arriving weekly as expected; if not, it sends you emails (with emoji!) to tell you to investigate.

Configuration options

See config.example.yml for a working example, or:



cess_key_id: whatever
cret_access_key: whatever
:
bucket: mybucket
bucket_path: moving_summonses
s:
topic_arn: arn:aws:sns:region:1234567890:topic-name`
l:
st: localhost
ername: root
ssword:
rt: 
tabase: 
l_pdfs_path: false # false means don't store PDFs locally, otherwise a path to a folder to store them
 'crime_stats.csv' 

When any of these options are unspecified, they will be silently ignored. (However, if the settings are invalid, an error will be thrown.) For instance, if the mysql block isn't supplied, data will not be sent to MySQL; if AWS is unspecified, PDFs will not be uploaded to S3 and status_checker.rb will not send notifications by email. An exception is the csv key: if this is unset, data will be saved to crime_stats.csv; set it to “false” or 0 to prevent any CSV from being generated.

If MySQL is specified in the config file, two tables will be created (or appended to, if they already exist) in the specified database: crimes_citywide and crimes_by_precinct. The record layout for each table is identical: citywide summaries are located in crimes_citywide and precinct-by-precinct data is in crimes_by_precinct.

All of these options can also be specified as ENV variables, flattening their paths, as follows: AWS_ACCESS_KEY_ID=whatever AWS_SNS_TOPIC_ARN=arn:aws:sns:region:1234567890:topic-name

Cron

You can use cron[https://en.wikipedia.org/wiki/Cron] to run this scraper automatically, on a regular basis.

E.g. to set up a weekly-ish cron to run on Wednesdays, add this to your crontab. (Use crontab -e to edit it.) 0 0 * * 4,5 /bin/bash/ -c 'export PATH="$HOME/.rbenv/bin:$PATH"; eval "$(rbenv init -)"; jruby -S ruby /bin/compstat_scraper.rb'

DOCKER and boot2docker
/src
er build compstat .
er run -it compstat bundle exec jruby bin/compstat_scraper.rb
Export from MySQL to CSV:

To export from MySQL to CSV:

l compstat -e "select * from crimes_by_precinct" | sed 's/  /","/g;s/^/"/;s/$/"/;s/\n//g' > crime_stats_from_mysql.csv

taking care to ensure that the first regex is a real tab. (If on Mac/BSD; on Unix, \t is fine.)

Want to contribute?

I welcome your contributions. If you have suggestions or issues, please register them in the Github issues page. If you'd like to add a feature or fix a bug, please open a Github pull request. Or send me an email, I'm happy to guide you through the process.

And, if you're using these, please let me know. I'd love to hear from you!


This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.