mozilla-services/mozilla-pipeline-schemas

Name: mozilla-pipeline-schemas

Owner: Mozilla Services

Description: null

Created: 2015-12-23 22:33:03.0

Updated: 2018-05-23 18:13:10.0

Pushed: 2018-05-24 15:53:54.0

Homepage: null

Size: 625

Language: Lua

GitHub Committers

UserMost Recent Commit# Commits

Other Committers

UserEmailMost Recent Commit# Commits

README

Mozilla Pipeline Schemas

This repository contains schemas for Mozilla's data ingestion pipeline and data lake outputs.

The JSON schemas are used to validate incoming submissions at ingestion time. The RapidJSON library is used for JSON Schema Validation. This has implications for what kinds of string patterns are supported, see the Conformance section in the linked document for further details.

To learn more about writing JSON Schemas, Understanding JSON Schema is a great resource.

The Parquet-MR schemas are used for direct to parquet output; some examples of Parquet-MR schemas can be found here: Parquet Schema Examples

Adding a new schema
Build
Prerequisites
CMake Build Instructions
git clone https://github.com/mozilla-services/mozilla-pipeline-schemas.git
cd mozilla-pipeline-schemas
mkdir release
cd release

cmake ..  # this is the build process (the schemas are built with cmake templates)
Running Tests via Docker

The tests expect example pings to be in the validation/<namespace>/ subdirectory, with files named in the form <ping type>.<version>.<test name>.pass.json for documents expected to be valid, or <ping type>.<version>.<test name>.fail.json for documents expected to fail validation. The test name should match the pattern [0-9a-zA-Z_]+

To run the tests:

# build the container with the pipeline schemas
docker build -t mps .

# run the tests
docker run mps
Packaging and integration tests (optional)

Follow the CMake Build Instructions above, then:

cpack -G TGZ # (DEB|RPM|ZIP)

# Integration Tests (run on schema-test EC2 instance)
  # If running locally
    # The following RPM's must be installed:
      # luasandbox, hindsight, luasandbox-lfs, luasandbox-lpeg, luasandbox-rjson, luasandbox-cjson, luasandbox-parquet
    # The following external libraries must be installed
      # parquet-cpp
make # this sets up the tests in the release directory
ctest -V -C hindsight # loads all the schemas and tests the inputs in the validation directory against them
Releases
Contributions
Notes

All schemas are generated from the 'templates' directory and written into the 'schemas' directory (i.e., the artifacts are generated/saved back into the repository) and validated against the draft 4 schema a copy of which resides in the 'tests' directory. The reason for this is twofold:

  1. It lets us easily see and refer to complete schemas as they are actually used. This means that the schemas can be referenced directly in bugs and such, as well as being fetched directly from the repo for testing other schema consumers (test being important here, as any production use should be using the installable packages).
  2. It gives us a changelog for each schema, rather than having to reason about changes to templated external pieces and when/how that impacted a given doctype's schema over time. This means that it should be easy to look back in time for the provenance of different parts of the schema for each doctype.

This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.