Intel-bigdata/hbase-spark-datasource

Name: hbase-spark-datasource

Owner: Intel-bigdata

Description: HBase datasource for SparkSQL and DataFrames

Created: 2016-04-22 02:32:29.0

Updated: 2017-11-02 01:59:12.0

Pushed: 2016-06-08 01:37:17.0

Homepage: null

Size: 455

Language: Scala

GitHub Committers

UserMost Recent Commit# Commits

Other Committers

UserEmailMost Recent Commit# Commits

README

hbase-spark-datasource

HBase datasource for SparkSQL and DataFrames.

This project is based on Spark 1.6 and HBase 0.98.17-hadoop2.

quick start

To use this package, you need to get the compiled jar file into every node of your cluster, and place them into extra classpath of your spark driver and executors. And you should add the hbase-spark-datasource.jar together with extra jars under HBASE_HOME/lib.

Usage in Spark-shell
./spark-shell --jars hbase-spark-datasource.jar:HBASE_HOME/lib/*.jar

you should write the basic catalog of your table and pass it to the connector.

import org.apache.spark.sql.datasources.hbase.HBaseTableCatalog

val catalog = s"""{"table":{"namespace":"default", "name":"test2"},"rowkey":"key","columns":{"KEY_FIELD":{"cf":"rowkey", "col":"key", "type":"string"},
                 |"A_FIELD":{"cf":"cf2", "col":"a", "type":"string"}
                 |}
                 |}""".stripMargin

val df = sqlContext.load("org.apache.hadoop.hbase.spark",Map(HBaseTableCatalog.tableCatalog->catalog,"hbase.config.resources"->"YOUR_HBASE_HOME/conf/hbase-site.xml"))

table means the table you take operations on, you can assign namespace and name here.

rowkey means the row key. The value is always "rowkey":"key".

columns includes every columns with column family and type of the table.

You must state rowkey as a column called KEY_FIELD here. And other rows one by one.

Now, you can use sqlContext.load() to load data and build related dataframe.

The first parameter should be "org.apache.hadoop.hbase.spark".

You can put some customize enviroment variables in the second parameter. But as least you should assign your catalog to HBaseTableCatalog.tableCatalog and your HBase configuration file to "hbase.config.resources".


This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.