Intel-bigdata/OAP

Name: OAP

Owner: Intel-bigdata

Description: Optimized Analytics Package for Spark Platform

Created: 2016-03-16 06:56:47.0

Updated: 2018-05-24 13:24:04.0

Pushed: 2018-05-24 04:59:36.0

Homepage:

Size: 136950

Language: Scala

GitHub Committers

UserMost Recent Commit# Commits

Other Committers

UserEmailMost Recent Commit# Commits

README

OAP - Optimized Analytics Package for Spark Platform

Build Status

OAP - Optimized Analytics Package (previously known as Spinach) is designed to accelerate Ad-hoc query. OAP defines a new parquet-like columnar storage data format and offering a fine-grained hierarchical cache mechanism in the unit of ?Fiber? in memory. What?s more, OAP has extended the Spark SQL DDL to allow user to define the customized indices based on relation.

Building
-DskipTests package
Prerequisites

You should have Apache Spark of version 2.1.0 installed in your cluster. Refer to Apache Spark's documents for details.

Use OAP with Spark
  1. Build OAP, mvn -DskipTests package and find oap-<version>.jar in target/
  2. Deploy oap-<version>.jar to master machine.
  3. Put below configurations to $SPARK_HOME/conf/spark-defaults.conf
    k.files                         file:///path/to/oap-dir/oap-<version>.jar
    k.executor.extraClassPath       ./oap-<version>.jar
    k.driver.extraClassPath         /path/to/oap-dir/oap-<version>.jar
    k.memory.offHeap.enabled        true
    k.memory.offHeap.size           20g
    
  4. Run spark by bin/spark-sql, bin/spark-shell, sbin/start-thriftserver or bin/pyspark and try our examples

NOTE: 1. For spark standalone mode, you have to put oap-<version>.jar to both driver and executor since spark.files is not working. Also don't forget to update extraClassPath.

      2. For yarn mode, we need to config all spark.driver.memory, spark.memory.offHeap.size and spark.yarn.executor.memoryOverhead (should be close to offHeap.size) to enable fiber cache.
      3. The comprehensive guidence and example of OAP configuration can be referred @https://github.com/Intel-bigdata/OAP/wiki/OAP-User-guide. Briefly speaking, the recommanded configuration is one executor per one node with fully memory/computation capability.
Example
n/spark-shell
ark.sql(s"""CREATE TEMPORARY TABLE oap_test (a INT, b STRING)
       | USING oap
       | OPTIONS (path 'hdfs:///oap-data-dir/')""".stripMargin)
l data = (1 to 300).map { i => (i, s"this is test $i") }.toDF().createOrReplaceTempView("t")
ark.sql("insert overwrite table oap_test select * from t")
ark.sql("create oindex index1 on oap_test (a)")
ark.sql("show oindex from oap_test").show()
ark.sql("SELECT * FROM oap_test WHERE a = 1").show()
ark.sql("drop oindex index1 on oap_test")

For a more detailed examples with performance compare, you can refer to this page for further instructions.

Running Test

To run all the tests, use

test

To run any specific test suite, for example OapDDLSuite, use

-DwildcardSuites=org.apache.spark.sql.execution.datasources.oap.OapDDLSuite test

To run test suites using LocalClusterMode, please refer to SharedOapLocalClusterContext

Features
Configurations and Performance Tuning

Parquet Support - Enable OAP support for parquet files

Fiber Cache Size - Total Memory size to cache Fiber, configured implicitly by 'spark.memory.offHeap.size'

Full Scan Threshold - If the analysis result is above this threshold, it will go through the whole data file instead of read index data.

Row Group Size - Row count for each row group

Compression Codec - Choose compression type for OAP data files.

Refer to OAP User guide for more details.

Query Example and Performance Data

Take 2 simple ad-hoc queries as instances, the store_sales table comes from TPCDS with data scale 200G. Generally we can see 5x boost in performance.

  1. “SELECT * FROM store_sales WHERE ss_ticket_number BETWEEN 100 AND 200”

Q6: | T1/ms | T2/ms | T3/ms | Median/ms ——————— | —– | —– | —– | ——— oap-with-index | 542 | 295 | 370 | 370
parquet-with-index | 1161 | 682 | 680 | 682
parquet-without-index | 2010 | 1922 | 1915 | 1922

  1. “SELECT * FROM store_sales WHERE ss_ticket_number < 10000 AND ss_net_paid BETWEEN 100.0 AND 110.0”)

Q12: | T1/ms | T2/ms | T3/ms | Median/ms ——————— | —– | —– | —– | ——— oap-with-index | 509| 431 | 437 | 437 parquet-with-index | 944| 930 | 1318 | 944 parquet-without-index | 2084| 1895 | 2007 | 2007

How to Contribute

If you are looking for some ideas on what to contribute, check out GitHub issues for this project labeled “Pick me up!”. Comment on the issue with your questions and ideas.

We tend to do fairly close readings of pull requests, and you may get a lot of comments. Some common issues that are not code structure related, but still important:


This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.