spotify/dbeam

Name: dbeam

Owner: Spotify

Description: DBeam extracts SQL tables using JDBC and Apache Beam

Created: 2017-11-09 13:21:09.0

Updated: 2018-05-23 15:22:27.0

Pushed: 2018-05-23 15:22:25.0

Homepage:

Size: 90

Language: Scala

GitHub Committers

UserMost Recent Commit# Commits

Other Committers

UserEmailMost Recent Commit# Commits

README

DBeam

Build Status codecov.io GitHub license Maven Central

A connector tool to extract data from SQL databases and import into GCS using Apache Beam.

This tool is runnable locally, or on any other backend supported by Apache Beam, e.g. Cloud Dataflow.

DEVELOPMENT STATUS: Alpha. Usable in production already.

Overview

DBeam is a Scio based single threaded pipeline that reads all the data from single SQL database table, and converts the data into Avro and stores it into appointed location, usually in GCS. Scio runs on Apache Beam.

DBeam requires the database credentials, the database table name to read, and the output location to store the extracted data into. DBeam first makes a single select into the target table with limit one to infer the table schema. After the schema is created the job will be launched which simply streams the table contents via JDBC into target location as Avro.

dbeam Java/Scala package features
dbeam arguments
Building

Build with SBT package to get a jar that you can run with java -cp. Notice that this won't create a fat jar, which means that you need to include dependencies on the class path.

package

You can also build the project with SBT pack, which will create a dbeam-pack/target/pack directory with all the dependencies, and also a shell script to run DBeam.

pack

Now you can run the script directly from created dbeam-pack directory:

eam-pack/target/pack/bin/jdbc-avro-job

TODO: We will be improving the packaging and releasing process shortly.

Examples
 -cp CLASS_PATH dbeam-core_2.12.jar com.spotify.dbeam.JdbcAvroJob \
output=gs://my-testing-bucket-name/ \
username=my_database_username \
password=secret \
connectionUrl=jdbc:postgresql://some.database.uri.example.org:5432/my_database \
table=my_table

For CloudSQL:

 -cp CLASS_PATH dbeam-core_2.12.jar com.spotify.dbeam.JdbcAvroJob \
output=gs://my-testing-bucket-name/ \
username=my_database_username \
password=secret \
connectionUrl=jdbc:postgresql://google/database?socketFactory=com.google.cloud.sql.postgres.SocketFactory&socketFactoryArg=project:region:cloudsql-instance \
table=my_table

To validate a data extraction one can run:

 -cp CLASS_PATH dbeam-core_2.12.jar com.spotify.dbeam.JdbcAvroJob \
output=gs://my-testing-bucket-name/ \
username=my_database_username \
password=secret \
connectionUrl=jdbc:postgresql://some.database.uri.example.org:5432/my_database \
table=my_table \
limit=10 \
skipPartitionCheck
Requirements

DBeam is built on top of Scio and supports both Scala 2.12 and 2.11.

To include DBeam library in a SBT project add the following in build.sbt:

braryDependencies ++= Seq(
com.spotify" %% "dbeam-core" % dbeamVersion

Development

Make sure you have sbt installed. For editor, IntelliJ IDEA with scala plugin is recommended.

To test and verify during development, run:

clean scalastyle test:scalastyle coverage test coverageReport coverageAggregate

This project adheres to the Open Code of Conduct. By participating, you are expected to honor this code.


License

Copyright 2016-2017 Spotify AB.

Licensed under the Apache License, Version 2.0: http://www.apache.org/licenses/LICENSE-2.0



This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.