Name: mozilla-pipeline-schemas
Owner: Mozilla Services
Description: null
Created: 2015-12-23 22:33:03.0
Updated: 2018-05-23 18:13:10.0
Pushed: 2018-05-24 15:53:54.0
Homepage: null
Size: 625
Language: Lua
GitHub Committers
User | Most Recent Commit | # Commits |
---|
Other Committers
User | Most Recent Commit | # Commits |
---|
This repository contains schemas for Mozilla's data ingestion pipeline and data lake outputs.
The JSON schemas are used to validate incoming submissions at ingestion time.
The RapidJSON library is used for
JSON Schema Validation. This has implications for what kinds of string patterns
are supported, see the Conformance
section in the linked document for further
details.
To learn more about writing JSON Schemas, Understanding JSON Schema is a great resource.
The Parquet-MR schemas are used for direct to parquet output; some examples of Parquet-MR schemas can be found here: Parquet Schema Examples
templates
directory first. Make use of common schema components from the templates/include
directory where possible, including things like the telemetry environment
, clientId
, application
block, or UUID patterns. The filename should be templates/<namespace>/<doctype>/<doctype>.<version>.schema.json
.templates/<namespace>/<doctype>/<doctype>.<version>.parquetmr.txt
.schemas
directory) in to the git repo as well. See the rationale for this in the “Notes” section below.validation
directory.dev
branch.git clone https://github.com/mozilla-services/mozilla-pipeline-schemas.git
cd mozilla-pipeline-schemas
mkdir release
cd release
cmake .. # this is the build process (the schemas are built with cmake templates)
The tests expect example pings to be in the validation/<namespace>/
subdirectory, with files named
in the form <ping type>.<version>.<test name>.pass.json
for documents expected to be valid, or
<ping type>.<version>.<test name>.fail.json
for documents expected to fail validation.
The test name
should match the pattern [0-9a-zA-Z_]+
To run the tests:
# build the container with the pipeline schemas
docker build -t mps .
# run the tests
docker run mps
Follow the CMake Build Instructions above, then:
cpack -G TGZ # (DEB|RPM|ZIP)
# Integration Tests (run on schema-test EC2 instance)
# If running locally
# The following RPM's must be installed:
# luasandbox, hindsight, luasandbox-lfs, luasandbox-lpeg, luasandbox-rjson, luasandbox-cjson, luasandbox-parquet
# The following external libraries must be installed
# parquet-cpp
make # this sets up the tests in the release directory
ctest -V -C hindsight # loads all the schemas and tests the inputs in the validation directory against them
dev
branch, direct commits to
master
are not permitted.All schemas are generated from the 'templates' directory and written into the 'schemas' directory (i.e., the artifacts are generated/saved back into the repository) and validated against the draft 4 schema a copy of which resides in the 'tests' directory. The reason for this is twofold: