confluentinc/avro-random-generator

Name: avro-random-generator

Owner: Confluent Inc.

Description: Used to generate mock Avro data

Created: 2018-04-26 21:16:54.0

Updated: 2018-05-24 03:00:12.0

Pushed: 2018-05-24 10:09:05.0

Homepage: null

Size: 166

Language: Java

GitHub Committers

UserMost Recent Commit# Commits

Other Committers

UserEmailMost Recent Commit# Commits

README

Arg: Avro Random Generator

NOTE: Building is required to run the program.

What does it do?
The boring stuff

Arg reads a schema through either stdin or a CLI-specified file and generates random data to fit it.

Arg can output data in either JSON or binary format, and when outputting in JSON, can either print in compact format (one instance of spoofed data per line) or pretty format.

Arg can output data either to stdout or a file. After outputting all of its spoofed data, Arg prints a single newline.

The number of instances of spoofed data can also be specified; the default is currently 1.

The cool stuff

Arg also allows for special annotations in the Avro schema it spoofs that narrow down the kind of data produced. For example, when spoofing a string, you can currently either specify a length that the string should be (or one or both of a minimum and maximum that the length should be), a list of possible strings that the string should come from, or a regular expression that the string should adhere to. These annotations are specified inside the schema that Arg spoofs, as parts of a JSON object with an attribute name of “arg.properties”.

These annotations are specified as JSON properties in the schema that Arg spoofs. They should not collide with any existing properties, or cause any issues if present when the schema is used with other programs.

Building
gradlew standalone
CLI Usage
$ ./arg -?
arg: Generate random Avro data
Usage: arg [-f <file> | -s <schema>] [-j | -b] [-p | -c] [-i <i>] [-o <file>]

Flags:
    -?, -h, --help: Print a brief usage summary and exit with status 0
    -b, --binary:   Encode outputted data in binary format
    -c, --compact:  Output each record on a single line of its own (has no effect if encoding is not JSON)
    -f <file>, --schema-file <file>:    Read the schema to spoof from <file>, or stdin if <file> is '-' (default is '-')
    -i <i>, --iterations <i>:   Output <i> iterations of spoofed data (default is 1)
    -j, --json: Encode outputted data in JSON format (default)
    -o <file>, --output <file>: Write data to the file <file>, or stdout if <file> is '-' (default is '-')
    -p, --pretty:   Output each record in prettified format (has no effect if encoding is not JSON) (default)
    -s <schema>, --schema <schema>: Spoof the schema <schema>

Source repository:
https://github.com/confluentinc/avro-random-generator
Schema annotations
Annotation types

The following annotations are currently supported:

The following schemas support the following annotations:

Primitives
null boolean int long float double bytes string

*Note: If both length and regex are specified for a string, the length property (if a JSON number) becomes a minimum length for the string

Complex
array enum fixed map record union
Example schemas

Example schemas are provided in the test/schemas directory. Here are a few of them:

enum.json

ame": "enum_comp",
ype": "enum",
ymbols": ["PRELUDE", "ALLEMANDE", "COURANTE", "SARABANDE", "MINUET", "BOURREE", "GAVOTTE", "GIGUE"]

A non-annotated schema. The resulting output will just be a random enum chosen from the symbols list.

regex.json

ype": "record",
ame": "regex_test",
ields":
[
  {
    "name": "no_length_property",
    "type":
      {
        "type": "string",
        "arg.properties": {
          "regex": "[a-zA-Z]{5,15}"
        }
      }
  },
  {
    "name": "number_length_property",
    "type":
      {
        "type": "string",
        "arg.properties": {
          "regex": "[a-zA-Z]*",
          "length": 10
        }
      }
  },
  {
    "name": "min_length_property",
    "type":
      {
        "type": "string",
        "arg.properties": {
          "regex": "[a-zA-Z]{0,15}",
          "length":
            {
              "min": 5
            }
        }
      }
  },
  {
    "name": "max_length_property",
    "type":
      {
        "type": "string",
        "arg.properties": {
          "regex": "[a-zA-Z]{5,}",
          "length":
            {
              "max": 16
            }
        }
      }
  },
  {
    "name": "min_max_length_property",
    "type":
      {
        "type": "string",
        "arg.properties": {
          "regex": "[a-zA-Z]*",
          "length":
            {
              "min": 5,
              "max": 16
            }
        }
      }
  }
]

An annotated record schema, with a variety of string fields. Each field has its own way of preventing the specified string from becoming too long, either via the length annotation or the regex annotation.

options-file.json

ype": "record",
ame": "sentence",
ields": [
{
  "name": "The",
  "type": {
    "type": "string",
      "arg.properties": {
        "options": [
          "The"
        ]
      }
  }
},
{
  "name": "noun",
  "type": {
    "type": "string",
    "arg.properties": {
      "options": {
        "file": "test/schemas/nouns-list.json",
        "encoding": "json"
      }
    }
  }
},
{
  "name": "is",
  "type": {
    "type": "string",
    "arg.properties": {
      "options": [
        "is",
        "was",
        "will be",
        "is being",
        "was being",
        "has been",
        "had been",
        "will have been"
      ]
    }
  }
},
{
  "name": "degree",
  "type": {
    "type": "string",
    "arg.properties": {
      "options": [
        "not at all",
        "slightly",
        "somewhat",
        "kind of",
        "pretty",
        "very",
        "entirely"
      ]
    }
  }
},
{
  "name": "adjective",
  "type": {
    "type": "string",
    "arg.properties": {
      "options": {
        "file": "test/schemas/adjectives-list.json",
        "encoding": "json"
      }
    }
  }
}


A record schema that draws its content from two files, 'nouns-list.json' and 'adjectives-list.json' to construct a primitive sentence. The script must be run from the repository base directory in order for this schema to work with it properly due, to the relative paths of the files.

options.json

ype": "record",
ame": "options_test_record",
ields": [
{
  "name": "array_field",
  "type": {
    "type": "array",
    "items": "string",
    "arg.properties":
      {
        "options": [
          [
            "Hello",
            "world"
          ],
          [
            "Goodbye",
            "world"
          ],
          [
            "We",
            "meet",
            "again",
            "world"
          ]
        ]
      }
  }
},
{
  "name": "enum_field",
  "type": {
    "type": "enum",
    "name": "enum_test",
    "symbols": [
      "HELLO",
      "HI_THERE",
      "GREETINGS",
      "SALUTATIONS",
      "GOODBYE"
    ],
    "arg.properties":
      {
        "options": [
          "HELLO",
          "SALUTATIONS"
        ]
      }
  }
},
{
  "name": "fixed_field",
  "type": {
    "type": "fixed",
    "name": "fixed_test",
    "size": 2,
    "arg.properties":
      {
        "options": [
          "\u0034\u0032",
          "\u0045\u0045"
        ]
      }
  }
},
{
  "name": "map_field",
  "type": {
    "type": "map",
    "values": "int",
    "arg.properties":
      {
        "options": [
          {
            "zero": 0
          },
          {
            "one": 1,
            "two": 2
          },
          {
            "three": 3,
            "four": 4,
            "five": 5
          },
          {
            "six": 6,
            "seven": 7,
            "eight": 8,
            "nine": 9
          }
        ]
      }
  }
},
{
  "name": "map_key_field",
  "type": {
    "type": "map",
    "values": {
      "type": "int",
      "arg.properties": {
        "options": [
          -1,
          0,
          1
        ]
      }
    },
    "arg.properties": {
      "length": 10,
      "keys": {
        "options": [
          "negative",
          "zero",
          "positive"
        ]
      }
    }
  }
},
{
  "name": "record_field",
  "type": {
    "type": "record",
    "name": "record_test",
    "fields": [
      {
        "name": "month",
        "type": "string"
      },
      {
        "name": "day",
        "type": "int"
      }
    ],
    "arg.properties": {
      "options": [
        {
          "month": "January",
          "day": 2
        },
        {
          "month": "NANuary",
          "day": 0
        }
      ]
    }
  }
},
{
  "name": "union_field",
  "type": [
    "null",
    {
      "type": "boolean",
      "arg.properties": {
        "options": [
          true
        ]
      }
    },
    {
      "type": "int",
      "arg.properties": {
        "options": [
          42
        ]
      }
    },
    {
      "type": "long",
      "arg.properties": {
        "options": [
          4242424242424242
        ]
      }
    },
    {
      "type": "float",
      "arg.properties": {
        "options": [
          42.42
        ]
      }
    },
    {
      "type": "double",
      "arg.properties": {
        "options": [
          42424242.42424242
        ]
      }
    },
    {
      "type": "bytes",
      "arg.properties": {
        "options": [
          "NDI="
        ]
      }
    },
    {
      "type": "string",
      "arg.properties": {
        "options": [
          "Forty-two"
        ]
      }
    }
  ]
}


A schema where every field is annotated with an example usage of the options annotation, as well as an example of the keys annotation.


This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.