citusdata/podyn

Name: podyn

Owner: Citus Data

Description: DynamoDB to PostgreSQL & Citus continuous replication tool

Created: 2017-07-21 16:32:06.0

Updated: 2018-05-23 10:22:43.0

Pushed: 2017-12-05 14:37:16.0

Homepage:

Size: 70

Language: Java

GitHub Committers

UserMost Recent Commit# Commits

Other Committers

UserEmailMost Recent Commit# Commits

README

Podyn: DynamoDB to PostgreSQL replication

The podyn tool replicates DynamoDB tables to PostgreSQL tables, which can optionally be distributed using Citus. It can also keep the tables in sync by continuously streaming changes.

Read the Blog post on Podyn.

Building from source

To build a shaded JAR, run:

clone https://github.com/citusdata/podyn.git
odyn
package

The JAR file will be at target/podyn-1.0.jar.

Running the JAR file

Once you've built the JAR, you can run it as follows.

dyn --help
e: podyn
--changes                    Continuously replicate changes
--data                       Replicate the current data
--help                       Show help
,--lower-case-column-names   Use lower case column names
--conversion-mode <arg>      Conversion mode, either columns or jsonb (default: columns)
--num-connections <arg>      Database connection pool size (default 16)
--scan-rate <arg>            Maximum reads/sec during scan (default 25)
--schema                     Replicate the table schema
--table <arg>                DynamoDB table name(s) to replicate
--postgres-jdbc-url <arg>    PostgreSQL JDBC URL of the destination
--citus                      Create distributed tables using Citus

When --postgres-jdbc-url is omitted, the SQL statements that would otherwise be sent to the database are sent to stdout. When --table is omitted, all DynamoDB tables in the region are replicated.

Replicate schema and data from DynamoDB

After setting up your AWS credentials, you can replicate the schema and do an initial data load by running:

rt AWS_REGION=us-east-1

dyn --postgres-jdbc-url "jdbc:postgresql://host:5432/citus?sslmode=require&user=citus&password=pw" --schema --data --citus

tructing table schema for table clicks
ng data for table clicks
ng new column to table clicks: ip text
ng new column to table clicks: object text

When --schema is specified, tables will be created in PostgreSQL as described in the Schema conversion rules section. If --citus is specified the tables will be distributed by the DynamoDB partition key. When the --data argument is specified, all the data in the DynamoDB table is scanned in batches and COPY is used to load the batch into postgres.

Stream changes from DynamoDB

After schema creation and the initial data load, you can continuously stream changes using:

dyn --postgres-jdbc-url "jdbc:postgresql://host:5432/citus?sslmode=require&user=citus&password=pw" --changes --citus

icating changes for table clicks

The changes are processed in batches and new fields are added to the table as columns. The changes are translated into delete or upsert statements that are sent to postgres over multiple connections (specified using -n) to achieve high throughput.

When running the command immediately after a data load, some changes that were made prior to the data load may be re-applied, causing the replicated database to temporarily regress. However, since the changes are applied in the same order they will eventually arrive at the current value. After loading a batch of changes into the database, a checkpoint is made. If the tool is restarted, it will continue from its last checkpoint. The checkpoints are stored in DynamoDB tables prefixed with podyn_migration_.

Schema conversion rules

Podyn currently has two ways of converting a DynamoDB schema to a PostgreSQL schema: columns and jsonb. The default is columns. To use the jsonb conversion mode add -m jsonb as an argument.

Columns mode

In the default schema conversion mode (-m columns), top-level keys in DynamoDB items are translated into columns. The initial schema is derived from the partition key, sort key and secondary indexes. When using Citus, the primary partition key becomes the distribution column.

DynamoDB types are mapped to PostgreSQL types according to the following table:

| DynamoDB type | PostgreSQL type | | ————- | ————— | | String | text | | Binary | bytea | | Numeric | numeric | | StringSet | jsonb | | NumberSet | jsonb | | BinarySet | jsonb | | Map | jsonb | | List | jsonb | | Boolean | boolean | | Null | text |

For example, a DynamoDB table named clicks has primary partition key: sitename (String), primary sort key: time (String), and a secondary index named pageid-index on: pageid (Numeric). The following statements will be sent to PostgreSQL:

TE TABLE clicks (
itename text NOT NULL,
time" text NOT NULL,
ageid numeric NOT NULL,
RIMARY KEY (sitename, "time")

TE INDEX "pageid-index" ON clicks (pageid);
CT create_distributed_table('clicks', 'sitename');

A new column is added to the PostgreSQL table whenever a new key appears in an item during the initial data load or while replicating changes. For example, if "ip" -> "198.51.100.3" key is encountered in an item and the table does not have an ip column, then the following command is sent to the database:

R TABLE clicks ADD COLUMN ip text;

After replicating a single item, the table might look as follows.

CT * FROM clicks;
ECORD 1 ]-------------------------------------------------------------------------------------------------------------------------------------
name | citusdata.com
     | 2017-09-18 16:08:36.989788+02
id   | 347712
     | 198.51.100.3
ct   | home_button

In DynamoDB, the same key can appear with different types as long as it's not part of the primary key or a secondary index. If Podyn encounters a key-value pair with a type that's not compatible with the column type, then a new column is added with the type name added as a suffix. For example, if an item appears which contains "ip" -> false, then Podyn would send the following command to PostgreSQL:

R TABLE clicks ADD COLUMN ip_boolean boolean;

If you have keys that have values of different types, or many different top-level keys, then you may want to consider using JSONB mode.

JSONB mode

In the JSONB conversion mode (-m jsonb), the initial schema is never changed. Instead, a single JSONB column is added that contains the entire DynamoDB item in JSON format.

Constructing the schema in jsonb mode is done in the same way as in columns mode (partition key and secondary indexes become columns) except with an extra column named data with the jsonb type. In the clicks example, Podyn would send the following statements to PostgreSQL:

TE TABLE clicks (
itename text NOT NULL,
time" text NOT NULL,
ageid numeric NOT NULL,
ata jsonb,
RIMARY KEY (sitename, "time")

TE INDEX "pageid-index" ON clicks (pageid);
CT create_distributed_table('clicks', 'sitename');

During replication, the entire DynamoDB item is converted to JSON, including the fields that already appear as columns, for example:

CT * FROM clicks;
ECORD 1 ]-------------------------------------------------------------------------------------------------------------------------------------
name | citusdata.com
     | 2017-09-18 16:08:36.989788+02
id   | 347712
     | {"ip": "198.51.100.3", "time": "2017-09-18 16:08:36.989788+02", "object": "home_button", "pageid": 347712, "sitename": "citusdata.com"}

The JSONB mode has an additional advantage that you can safely set up multiple Podyn instances for replicating changes. They will automatically divide the work and if a node fails, the other node(s) will automatically take over.


This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.