spotify/spark-bigquery

Name: spark-bigquery

Owner: Spotify

Description: Google BigQuery support for Spark, SQL, and DataFrames

Created: 2016-04-22 20:17:31.0

Updated: 2018-05-21 00:34:46.0

Pushed: 2018-04-17 19:52:52.0

Homepage: null

Size: 59

Language: Scala

GitHub Committers

UserMost Recent Commit# Commits

Other Committers

UserEmailMost Recent Commit# Commits

README

MAINTENANCE MODE

THIS PROJECT IS IN MAINTENANCE MODE DUE TO THE FACT THAT IT'S NOT WIDELY USED WITHIN SPOTIFY. WE'LL PROVIDE BEST EFFORT SUPPORT FOR ISSUES AND PULL REQUESTS BUT DO EXPECT DELAY IN RESPONSES.

spark-bigquery

Build Status GitHub license Maven Central

Google BigQuery support for Spark, SQL, and DataFrames.

| spark-bigquery version | Spark version | Comment | | :——————–: | ————- | ——- | | 0.2.x | 2.x.y | Active development | | 0.1.x | 1.x.y | Development halted |

To use the package in a Google Cloud Dataproc cluster:

spark-shell --packages com.spotify:spark-bigquery_2.10:0.2.0

To use it in a local SBT console:

rt com.spotify.spark.bigquery._

et up GCP credentials
ontext.setGcpJsonKeyFile("<JSON_KEY_FILE>")

et up BigQuery project and bucket
ontext.setBigQueryProjectId("<BILLING_PROJECT>")
ontext.setBigQueryGcsBucket("<GCS_BUCKET>")

et up BigQuery dataset location, default is US
ontext.setBigQueryDatasetLocation("<DATASET_LOCATION>")

Usage:

oad everything from a table
table = sqlContext.bigQueryTable("bigquery-public-data:samples.shakespeare")

oad results from a SQL query
nly legacy SQL dialect is supported for now
df = sqlContext.bigQuerySelect(
ELECT word, word_count FROM [bigquery-public-data:samples.shakespeare]")

ave data to a table
aveAsBigQueryTable("my-project:my_dataset.my_table")

If you'd like to write nested records to BigQuery, be sure to specify an Avro Namespace. BigQuery is unable to load Avro Namespaces with a leading dot (.nestedColumn) on nested records.

igQuery is able to load fields with namespace 'myNamespace.nestedColumn'
aveAsBigQueryTable("my-project:my_dataset.my_table", tmpWriteOptions = Map("recordNamespace" -> "myNamespace"))

See also Loading Avro Data from Google Cloud Storage for data type mappings and limitations. For example loading arrays of arrays is not supported.

License

Copyright 2016 Spotify AB.

Licensed under the Apache License, Version 2.0: http://www.apache.org/licenses/LICENSE-2.0


This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.