CD2H gitForager

racker/gizer

Name: gizer

Owner: racker

Description: Etl engine for MongoDB, prerequisite for racker/caspian-data-access

Created: 2016-04-05 19:29:42.0

Updated: 2017-02-20 20:53:08.0

Pushed: 2017-06-28 18:15:47.0

Homepage:

Size: 605

Language: Python

GitHub Committers

User	Most Recent Commit	# Commits

Other Committers

User	Email	Most Recent Commit	# Commits

README

Intro

Application requires two connections to PostgreSQL instances: one is using for caching purposes and another is real target instance. At least PostgreSQL 9.3 required. This requirement is coming from initial database synchronization (Init load).

Solution is divided into 3 phases:

Schema acquisition.
See get_mongo_schema_as_json.py tool from https://github.com/YaroslavLitvinov/mongo_schema. That tool creates schema of a data to be used during init & oplog load stages.
Init load.
It's performing by bulk psql copy. During this phase data is reading directly from mongo connector, so no oplog connector is using on this stage. It's can be useful to run only this stage if you just need to update your postgres data once a day.
mongo_reader.py and psql_copy.py are the right tools for this stage. mongo_reader.py reads data from mongodb into csv files.
psql_copy.py exports colections of csv files into Postgres.
Oplog load.
Sharded cluster supported. Every mongodb replica should be specified in config file as separate section. In this mode oplog reader data is reading from syncronization point and transforming oplog data to appropriate sqls. Additionally, just before commiting changes into Postgres it makes reverse load of postgres data and comparing it to actual mongodb record's data. It creates additional overhead, but ensures us from having non consistent data in Postgres database. In case if inconsistency detected, bad data will be fixed by reloading specific record or in worst case it can lead to complete reload of whole database. So, in short this is a startegy that keeps data consistent. As solution is kind of batch mongo_to_psql must be invoked every time in order to process freshly added mongodb's changes. Simple crontab scheduler can be used for this. So every app invoke will deliver fresh pattches to PostgreSQL. For example, you can run this 'synchronization' tool every minute or five.
Tool mongo_to_psql.py serves for this stage.

Environment

Python 2.7.x
Add directory containing mongo_schema module to python path. If modules are resided in git directory do:
Specify TEST_PSQLCONN env variable before running tests

create test postgresql user at once (following will create admin user)

ateuser -sdrl zvm
atedb zvm
l -U postgres -c "ALTER USER zvm WITH PASSWORD 'zvm';"

Test it ./run_cov_tests.sh or py.test

Config file.

Connection settings, etc. See sample-config.ini for inspiration.

Tools.

mongo_reader.py - 1st part of initial load. It's creates relational model of data for specific collection and then saves all collection's data into csv files.
psql_copy.py - 2nd part of initial load. Export scv files previously created by mongo_reader.py into postgres tables.
etlstatus.py - application managing etl status by external application;
config_value.py - external config reader, one value per execution;
mongo_to_psql.py - application for syncing, handling oplog.

Command line examples

Acquire schema

 latest oplog timestamp before running init load<br>

Run init load part 1 of 2

init load part 2 of 2

When init load finishes save completion status ok/error

fy etl status, if exit code is 1 then run init load<br>

Run following command every time to update postgres database by MongoDB data

now issues.<br>
hema items' types should be strictly defined. Incorrectly defined types may lead to errors.
elds which are not in schema or whose have different types will not be loaded to relational model

This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.