allegro/camus-compressor

Name: camus-compressor

Owner: Allegro Tech

Description: Camus Compressor merges files created by Camus and saves them in a compressed format.

Created: 2015-06-18 12:15:30.0

Updated: 2018-02-20 19:16:03.0

Pushed: 2017-03-21 11:52:17.0

Homepage:

Size: 193

Language: Java

GitHub Committers

UserMost Recent Commit# Commits

Other Committers

UserEmailMost Recent Commit# Commits

README

Camus Compressor

Build Status

Camus Compressor merges files created by Camus and saves them in a compressed format.

Camus is massively used at Allegro for dumping more than 200 Kafka topics onto HDFS. The script runs every 15 minutes and creates one file per Kafka partition which results in about 76800 small files per day. Most of the files do not exceed Hadoop block size. This is a clear Hadoop antipattern which leads to performance issues, for example extensive number of mappers in SQL queries? executions.

Camus Compressor solves this issue by merging files within Hive partition and compressing them. It does not change Camus directories structure and supports well daily and hourly partitioning. The tool runs in YARN and is build on Spark.

Supported compressors
How to use

As mentioned above You need Spark packages to run Camus Compressor. Provided src/main/resources/compressor.sh file helps executing spark-submit commands by setting options:

Configuration file

In a configuration file (/etc/camus-compressor/camus-compressor.properties) you can set following options:

How to build

Camus Compressor is shipped as fatjar file or Debian package and build using Gradle. To build fat-jar run in build/libs/:

./gradlew shadowJar

To build debian package camus-compressor in build/, run:

./gradlew shadowJar prepareControlFile buildDeb
Sample usage

Before executing compressor.sh script, make sure that SPARK_SUBMIT variable is set to the spark-submit location of Spark in version 1.6.1 or above.

export SPARK_SUBMIT="/usr/bin/spark-submit"
compressor.sh -P /etc/camus-compressor/camus-compressor.properties \
    -e 10 \
    -q default \
    -d 4g \
    -m yarn-cluster \
    -c "spark.executor.extraJavaOptions=-XX:+PrintGCDetails -XX:+PrintGCTimeStamps"
License

Copyright 2015 Allegro Group

Licensed under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.


This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.