hortonworks/spark-native-yarn

Name: spark-native-yarn

Owner: Hortonworks Inc

Description: Tez port for Spark API

Created: 2014-06-24 14:19:16.0

Updated: 2017-12-21 18:27:24.0

Pushed: 2017-03-15 22:19:17.0

Homepage: null

Size: 7433

Language: Scala

GitHub Committers

UserMost Recent Commit# Commits

Other Committers

UserEmailMost Recent Commit# Commits

README

spark-native-yarn
Native YARN integration with Apache Spark

For feedback and suggestions please use this project's Issues feature.

============

IMPORTANT: At the time or writing, the project represents a prototype with the goal of demonstrating the validity of the approach described in SPARK-3561. To get an idea of currently supported functionality please refer to APIDemoTests as well as Samples project.

==

spark-native-yarn project represents an extension to Apache Spark which enables DAGs assembled using SPARK API to run on Apache Tez, thus allowing one to benefit from native features of Tez, especially related to large scale Batch/ETL applications.

Aside from enabling SPARK DAG execution to run on Apache Tez, this project provides additional functionality which addresses developer productivity including but not limited to:

At the moment of writing, spark-native-yarn is dependent on modifications to SPARK code described in SPARK-3561. This means that to use it, one must have a custom build of Spark which incorporates pending GitHub Pull Request. You can build your own by following instructions below or you can download a pre-built distribution from here.

IMPORTANT: If you opt out for a pre-build distribution keep in mind that it is based on Spark 1.1 release, which means you have to use a compatible spark-native-yarn version branch 1.1.1.

For those who want to take their chances with the latest Spark's snapshot, please follow the instructions below, otherwise (for pre-built) skip and go straight to build spark-native-yarn or follow the pre-built spark-shell and/or spark-submit instructions.

Below are the prerequisites and instructions on how to proceed.

IMPORTANT: Please follow the prerequisites described below and then continue to Getting Started guide.

Checkout and Build SPARK-3561
it clone https://github.com/olegz/spark-1.git
d spark-1
it fetch --all

Switch to SPARK-3561 branch

it branch --track SH-1 origin/SH-1
it checkout SH-1

Spark uses Maven for its build so it must be present. And to ensure there are no OOM errors set up Maven options as below. See Spark's documentation for more details.

rt MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m"
Build and install SPARK-3561 into your local maven repository
vn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean install

The build should take 20-30 min depending on your machine. You should see a successful build

] ------------------------------------------------------------------------
O] Reactor Summary:
O]
O] Spark Project Parent POM .......................... SUCCESS [  2.281 s]
O] Spark Project Core ................................ SUCCESS [02:33 min]
O] Spark Project Bagel ............................... SUCCESS [ 18.959 s]
.
O] ------------------------------------------------------------------------
O] BUILD SUCCESS
O] ------------------------------------------------------------------------
Clone spark-native-yarn
it clone https://github.com/hortonworks/spark-native-yarn.git
d spark-native-yarn

To switch to 1.1.1 branch:

it fetch --all
it branch --track 1.1.1 origin/1.1.1
it checkout 1.1.1

This completes the pre-requisite required to run STARK and you can now continue to Getting Started guide.

==


This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.