uwescience/tpch-spark

Name: tpch-spark

Owner: UW eScience Institute

Description: TPC-H queries in spark SQL using native DataFrames API

Forked from: ssavvides/tpch-spark

Created: 2016-04-05 22:31:35.0

Updated: 2016-04-05 22:31:37.0

Pushed: 2016-04-05 22:36:57.0

Homepage:

Size: 399

Language: C

GitHub Committers

UserMost Recent Commit# Commits

Other Committers

UserEmailMost Recent Commit# Commits

README

tpch-spark

TPC-H queries implemented in Spark using the DataFrames API (introduced in Spark 1.3.0)

Savvas Savvides

ssavvides@us.ibm.com

savvas@purdue.edu

Running

First compile using:

package

Make sure you set the INPUT_DIR and OUTPUT_DIR in TpchQuery class before compiling to point to the location the of the input data and where the output should be saved.

You can then run a query using:

k-submit --class "main.scala.TpchQuery" --master MASTER target/scala-2.10/spark-tpc-h-queries_2.10-1.0.jar ##

where ## is the number of the query to run e.g 1, 2, …, 22 and MASTER specifies the spark-mode e.g local, yarn, standalone etc…

Other Implementations
  1. Data generator (http://www.tpc.org/tpch/)

  2. TPC-H for Hive (https://issues.apache.org/jira/browse/hive-600)

  3. TPC-H for PIG (https://github.com/ssavvides/tpch-pig)


This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.