CD2H gitForager

HearthSim/hsredshift

Name: hsredshift

Owner: HearthSim

Description: ? Data ingestion and ETL using Amazon Redshift.

Created: 2017-01-11 06:11:36.0

Updated: 2018-05-22 22:17:33.0

Pushed: 2018-05-22 22:17:31.0

Homepage: https://hsreplay.net

Size: 1452

Language: Python

GitHub Committers

User	Most Recent Commit	# Commits

Other Committers

User	Email	Most Recent Commit	# Commits

README

Redshift Data Warehouse Libraries for HSReplay.net

Libraries for ETL and analysis of HSReplay.xml files via Redshift.

Running MRJobs

This section assumes that your working directory is the repository root.

To run a job locally:

thon <JOB_NAME>.py <INPUT_FILE.TXT>

To generate an inputs.txt data set:

THONPATH=$PYTHONPATH:. python ./loaders/emr/generate_inputs.py

By default the file will go to loaders/emr/data/ this will also stage the inputs.txt file on S3 to make launching the job faster. To bootstrap a cluster:

job create-cluster --conf-path ./loaders/emr/mrjob.conf
STER-ID>

Then to run the job on the cluster:

thon my_job.py -r emr --conf-path mrjob.conf --cluster-id <CLUSTER-ID> <INPUT_S3_PATH> --no-output --connection="<upload-db connection string>"

A concrete example might look like:

ONPATH=$PYTHONPATH:. python ./loaders/emr/load_redshift.py -r emr --conf-path ./loaders/emr/mrjob.conf --cluster-id j-2YYNNMRVT35MN s3://hearthsim-mrjob/data/2017-01-05_00-00-00_TO_2017-01-05_00-10-00_inputs.txt --no-output

License

Community

This is a HearthSim project. All development happens on our IRC channel #hearthsim on Freenode.

This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.