Name: OAP
Owner: Intel-bigdata
Description: Optimized Analytics Package for Spark Platform
Created: 2016-03-16 06:56:47.0
Updated: 2018-05-24 13:24:04.0
Pushed: 2018-05-24 04:59:36.0
Size: 136950
Language: Scala
GitHub Committers
User | Most Recent Commit | # Commits |
---|
Other Committers
User | Most Recent Commit | # Commits |
---|
OAP - Optimized Analytics Package (previously known as Spinach) is designed to accelerate Ad-hoc query. OAP defines a new parquet-like columnar storage data format and offering a fine-grained hierarchical cache mechanism in the unit of ?Fiber? in memory. What?s more, OAP has extended the Spark SQL DDL to allow user to define the customized indices based on relation.
-DskipTests package
You should have Apache Spark of version 2.1.0 installed in your cluster. Refer to Apache Spark's documents for details.
mvn -DskipTests package
and find oap-<version>.jar
in target/
oap-<version>.jar
to master machine.k.files file:///path/to/oap-dir/oap-<version>.jar
k.executor.extraClassPath ./oap-<version>.jar
k.driver.extraClassPath /path/to/oap-dir/oap-<version>.jar
k.memory.offHeap.enabled true
k.memory.offHeap.size 20g
bin/spark-sql
, bin/spark-shell
, sbin/start-thriftserver
or bin/pyspark
and try our examplesNOTE: 1. For spark standalone mode, you have to put oap-<version>.jar
to both driver and executor since spark.files
is not working. Also don't forget to update extraClassPath
.
2. For yarn mode, we need to config all spark.driver.memory, spark.memory.offHeap.size and spark.yarn.executor.memoryOverhead (should be close to offHeap.size) to enable fiber cache.
3. The comprehensive guidence and example of OAP configuration can be referred @https://github.com/Intel-bigdata/OAP/wiki/OAP-User-guide. Briefly speaking, the recommanded configuration is one executor per one node with fully memory/computation capability.
n/spark-shell
ark.sql(s"""CREATE TEMPORARY TABLE oap_test (a INT, b STRING)
| USING oap
| OPTIONS (path 'hdfs:///oap-data-dir/')""".stripMargin)
l data = (1 to 300).map { i => (i, s"this is test $i") }.toDF().createOrReplaceTempView("t")
ark.sql("insert overwrite table oap_test select * from t")
ark.sql("create oindex index1 on oap_test (a)")
ark.sql("show oindex from oap_test").show()
ark.sql("SELECT * FROM oap_test WHERE a = 1").show()
ark.sql("drop oindex index1 on oap_test")
For a more detailed examples with performance compare, you can refer to this page for further instructions.
To run all the tests, use
test
To run any specific test suite, for example OapDDLSuite
, use
-DwildcardSuites=org.apache.spark.sql.execution.datasources.oap.OapDDLSuite test
To run test suites using LocalClusterMode
, please refer to SharedOapLocalClusterContext
Parquet Support - Enable OAP support for parquet files
sqlContext.conf.setConfString(SQLConf.OAP_PARQUET_ENABLED.key, "false")
Fiber Cache Size - Total Memory size to cache Fiber, configured implicitly by 'spark.memory.offHeap.size'
spark.memory.offHeap.size * 0.7
Full Scan Threshold - If the analysis result is above this threshold, it will go through the whole data file instead of read index data.
sqlContext.conf.setConfString(SQLConf.OAP_FULL_SCAN_THRESHOLD.key, "0.8")
Row Group Size - Row count for each row group
sqlContext.conf.setConfString(SQLConf.OAP_ROW_GROUP_SIZE.key, "1048576")
CREATE TABLE t USING oap OPTIONS ('rowgroup' '1048576')
Compression Codec - Choose compression type for OAP data files.
sqlContext.conf.setConfString(SQLConf.OAP_COMPRESSION.key, "SNAPPY")
CREATE TABLE t USING oap OPTIONS ('compression' 'SNAPPY')
Refer to OAP User guide for more details.
Take 2 simple ad-hoc queries as instances, the store_sales table comes from TPCDS with data scale 200G. Generally we can see 5x boost in performance.
Q6: | T1/ms | T2/ms | T3/ms | Median/ms
——————— | —– | —– | —– | ———
oap-with-index | 542 | 295 | 370 | 370
parquet-with-index | 1161 | 682 | 680 | 682
parquet-without-index | 2010 | 1922 | 1915 | 1922
Q12: | T1/ms | T2/ms | T3/ms | Median/ms ——————— | —– | —– | —– | ——— oap-with-index | 509| 431 | 437 | 437 parquet-with-index | 944| 930 | 1318 | 944 parquet-without-index | 2084| 1895 | 2007 | 2007
If you are looking for some ideas on what to contribute, check out GitHub issues for this project labeled “Pick me up!”. Comment on the issue with your questions and ideas.
We tend to do fairly close readings of pull requests, and you may get a lot of comments. Some common issues that are not code structure related, but still important:
mvn license:format
command.a+b
but a + b
and not def foo(a:Int,b:Int):Int
but def foo(a: Int, b: Int): Int
.