Name: tpch-spark
Owner: UW eScience Institute
Description: TPC-H queries in spark SQL using native DataFrames API
Forked from: ssavvides/tpch-spark
Created: 2016-04-05 22:31:35.0
Updated: 2016-04-05 22:31:37.0
Pushed: 2016-04-05 22:36:57.0
Size: 399
Language: C
GitHub Committers
User | Most Recent Commit | # Commits |
---|
Other Committers
User | Most Recent Commit | # Commits |
---|
TPC-H queries implemented in Spark using the DataFrames API (introduced in Spark 1.3.0)
Savvas Savvides
ssavvides@us.ibm.com
savvas@purdue.edu
First compile using:
package
Make sure you set the INPUT_DIR and OUTPUT_DIR in TpchQuery class before compiling to point to the location the of the input data and where the output should be saved.
You can then run a query using:
k-submit --class "main.scala.TpchQuery" --master MASTER target/scala-2.10/spark-tpc-h-queries_2.10-1.0.jar ##
where ## is the number of the query to run e.g 1, 2, …, 22 and MASTER specifies the spark-mode e.g local, yarn, standalone etc…
Data generator (http://www.tpc.org/tpch/)
TPC-H for Hive (https://issues.apache.org/jira/browse/hive-600)
TPC-H for PIG (https://github.com/ssavvides/tpch-pig)