Name: hsredshift
Owner: HearthSim
Description: ? Data ingestion and ETL using Amazon Redshift.
Created: 2017-01-11 06:11:36.0
Updated: 2018-05-22 22:17:33.0
Pushed: 2018-05-22 22:17:31.0
Homepage: https://hsreplay.net
Size: 1452
Language: Python
GitHub Committers
User | Most Recent Commit | # Commits |
---|
Other Committers
User | Most Recent Commit | # Commits |
---|
Libraries for ETL and analysis of HSReplay.xml files via Redshift.
This section assumes that your working directory is the repository root.
To run a job locally:
thon <JOB_NAME>.py <INPUT_FILE.TXT>
To generate an inputs.txt data set:
THONPATH=$PYTHONPATH:. python ./loaders/emr/generate_inputs.py
By default the file will go to loaders/emr/data/
this will also stage the inputs.txt file on S3 to make launching
the job faster. To bootstrap a cluster:
job create-cluster --conf-path ./loaders/emr/mrjob.conf
STER-ID>
Then to run the job on the cluster:
thon my_job.py -r emr --conf-path mrjob.conf --cluster-id <CLUSTER-ID> <INPUT_S3_PATH> --no-output --connection="<upload-db connection string>"
A concrete example might look like:
ONPATH=$PYTHONPATH:. python ./loaders/emr/load_redshift.py -r emr --conf-path ./loaders/emr/mrjob.conf --cluster-id j-2YYNNMRVT35MN s3://hearthsim-mrjob/data/2017-01-05_00-00-00_TO_2017-01-05_00-10-00_inputs.txt --no-output
Copyright © HearthSim - All Rights Reserved
This is a HearthSim project. All development
happens on our IRC channel #hearthsim
on Freenode.