Name: gizer
Owner: racker
Description: Etl engine for MongoDB, prerequisite for racker/caspian-data-access
Created: 2016-04-05 19:29:42.0
Updated: 2017-02-20 20:53:08.0
Pushed: 2017-06-28 18:15:47.0
Homepage:
Size: 605
Language: Python
GitHub Committers
User | Most Recent Commit | # Commits |
Other Committers
User | Email | Most Recent Commit | # Commits |
README
Intro
Application requires two connections to PostgreSQL instances: one is
using for caching purposes and another is real target instance. At
least PostgreSQL 9.3 required. This requirement is coming from initial
database synchronization (Init load).
Solution is divided into 3 phases:
- Schema acquisition.
See get_mongo_schema_as_json.py tool from
https://github.com/YaroslavLitvinov/mongo_schema. That tool
creates schema of a data to be used during init & oplog load
stages.
- Init load.
It's performing by bulk psql copy. During this phase data
is reading directly from mongo connector, so no oplog connector is
using on this stage. It's can be useful to run only this stage if
you just need to update your postgres data once a day.
mongo_reader.py and psql_copy.py are the right tools for this stage.
mongo_reader.py reads data from mongodb into csv files.
psql_copy.py exports colections of csv files into Postgres.
- Oplog load.
Sharded cluster supported. Every mongodb replica should be specified
in config file as separate section. In this mode oplog reader data is reading from syncronization point and transforming oplog data to appropriate sqls. Additionally, just before commiting changes into Postgres it makes reverse load of postgres data and comparing it to actual mongodb record's data. It creates additional overhead, but ensures us from having non consistent data in Postgres database. In case if inconsistency detected, bad data will be fixed by reloading specific record or in worst case it can lead to complete reload of whole database. So, in short this is a startegy that keeps data consistent. As solution is kind of batch mongo_to_psql
must be invoked every time in order to process freshly added
mongodb's changes. Simple crontab scheduler can be used for this. So every app invoke will deliver fresh pattches to PostgreSQL. For example, you can run this 'synchronization' tool every minute or five.
Tool mongo_to_psql.py serves for this stage.
Environment
Config file.
Connection settings, etc. See sample-config.ini for inspiration.
Tools.
- mongo_reader.py - 1st part of initial load.
It's creates relational model of data for specific collection and then saves all collection's data into csv files.
- psql_copy.py - 2nd part of initial load.
Export scv files previously created by mongo_reader.py into postgres tables.
- etlstatus.py - application managing etl status by external application;
- config_value.py - external config reader, one value per execution;
- mongo_to_psql.py - application for syncing, handling oplog.
Command line examples
Acquire schema
latest oplog timestamp before running init load<br>
Run init load part 1 of 2
init load part 2 of 2
When init load finishes save completion status ok/error
fy etl status, if exit code is 1 then run init load<br>
Run following command every time to update postgres database by MongoDB data
now issues.<br>
hema items' types should be strictly defined. Incorrectly defined types may lead to errors.
elds which are not in schema or whose have different types will not be loaded to relational model