spotify/ratatool

Name: ratatool

Owner: Spotify

Description: A tool for data sampling, data generation, and data diffing

Created: 2016-08-01 17:33:25.0

Updated: 2018-05-23 18:31:25.0

Pushed: 2018-05-24 13:34:41.0

Homepage:

Size: 344

Language: Scala

GitHub Committers

UserMost Recent Commit# Commits

Other Committers

UserEmailMost Recent Commit# Commits

README

Ratatool

Build Status codecov.io GitHub license Maven Central

A tool for random data sampling and generation

Features

Usage

If you use sbt add the following dependency to your build file:

aryDependencies += "com.spotify" %% "ratatool-scalacheck" % "0.3.2" % "test"

If needed, the following other libraries are published:

Or install via our Homebrew tap if you're on a Mac:

 tap spotify/public
 install ratatool
tool

Or download the release jar and run it.

 https://github.com/spotify/ratatool/releases/download/v0.3.2/ratatool-0.3.2.tar.gz
ratatool directSampler

The command line tool can be used to sample from local file system or Google Cloud Storage directly if Google Cloud SDK is installed and authenticated.

ratatool bigSampler avro --head -n 1000 --in gs://path/to/dataset --out out.avro
ratatool bigSampler parquet --head -n 1000 --in gs://path/to/dataset --out out.parquet

ite output to both JSON file and BigQuery table
ratatool bigSampler bigquery --head -n 1000 --in project_id:dataset_id.table_id \
--out out.json--tableOut project_id:dataset_id.table_id

It can also be used to sample from HDFS with if core-site.xml and hdfs-site.xml are available.

ratatool bigSampler avro \
--head -n 10 --in hdfs://namenode/path/to/dataset --out file:///path/to/out.avro

Or execute BigDiffy directly

ratatool bigDiffy --mode avro --key record.key \
--lhs gs://path/to/left --rhs gs://path/to/right --output gs://path/to/output \
--runner DataflowRunner ....

License

Copyright 2016-2018 Spotify AB.

Licensed under the Apache License, Version 2.0: http://www.apache.org/licenses/LICENSE-2.0


This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.