Name: ratatool
Owner: Spotify
Description: A tool for data sampling, data generation, and data diffing
Created: 2016-08-01 17:33:25.0
Updated: 2018-05-23 18:31:25.0
Pushed: 2018-05-24 13:34:41.0
Size: 344
Language: Scala
GitHub Committers
User | Most Recent Commit | # Commits |
---|
Other Committers
User | Most Recent Commit | # Commits |
---|
A tool for random data sampling and generation
Gen[T]
) for property-based testing for Avro, Protocol Buffers and BigQuery TableRowIf you use sbt add the following dependency to your build file:
aryDependencies += "com.spotify" %% "ratatool-scalacheck" % "0.3.2" % "test"
If needed, the following other libraries are published:
ratatool-diffy
ratatool-sampling
Or install via our Homebrew tap if you're on a Mac:
tap spotify/public
install ratatool
tool
Or download the release jar and run it.
https://github.com/spotify/ratatool/releases/download/v0.3.2/ratatool-0.3.2.tar.gz
ratatool directSampler
The command line tool can be used to sample from local file system or Google Cloud Storage directly if Google Cloud SDK is installed and authenticated.
ratatool bigSampler avro --head -n 1000 --in gs://path/to/dataset --out out.avro
ratatool bigSampler parquet --head -n 1000 --in gs://path/to/dataset --out out.parquet
ite output to both JSON file and BigQuery table
ratatool bigSampler bigquery --head -n 1000 --in project_id:dataset_id.table_id \
--out out.json--tableOut project_id:dataset_id.table_id
It can also be used to sample from HDFS with if core-site.xml
and hdfs-site.xml
are available.
ratatool bigSampler avro \
--head -n 10 --in hdfs://namenode/path/to/dataset --out file:///path/to/out.avro
Or execute BigDiffy directly
ratatool bigDiffy --mode avro --key record.key \
--lhs gs://path/to/left --rhs gs://path/to/right --output gs://path/to/output \
--runner DataflowRunner ....
Copyright 2016-2018 Spotify AB.
Licensed under the Apache License, Version 2.0: http://www.apache.org/licenses/LICENSE-2.0