RADAR-base/Restructure-HDFS-topic

Name: Restructure-HDFS-topic

Owner: RADAR-CNS

Owner: RADAR-CNS

Description: Reads avro files in HDFS and outputs json per topic per user in local file system

Created: 2017-04-13 08:49:55.0

Updated: 2017-08-03 18:18:40.0

Pushed: 2017-11-15 11:30:51.0

Homepage:

Size: 189

Language: Java

GitHub Committers

UserMost Recent Commit# Commits

Other Committers

UserEmailMost Recent Commit# Commits

README

Restructure HDFS files

Build Status

Data streamed to HDFS using the RADAR HDFS sink connector is streamed to files based on sensor only. This package can transform that output to a local directory structure as follows: userId/topic/date_hour.csv. The date and hour is extracted from the time field of each record, and is formatted in UTC time.

Usage

This package is included in the RADAR-Docker repository, in the dcompose/radar-cp-hadoop-stack/hdfs_restructure.sh script.

Advanced usage

Build jar from source with

adlew build

and find the output JAR file as build/libs/restructurehdfs-0.3.1-all.jar. Then run with:

 -jar restructurehdfs-0.3.1-all.jar <webhdfs_url> <hdfs_topic_path> <output_folder>

By default, this will output the data in CSV format. If JSON format is preferred, use the following instead:

 -Dorg.radarcns.format=json -jar restructurehdfs-0.3.1-all.jar <webhdfs_url> <hdfs_topic_path> <output_folder>

Another option is to output the data in compressed form. All files will get the gz suffix, and can be decompressed with a GZIP decoder. Note that for a very small number of records, this may actually increase the file size.

 -Dorg.radarcns.compress=gzip -jar restructurehdfs-0.3.1-all.jar <webhdfs_url> <hdfs_topic_path> <output_folder>

Finally, files records are deduplicated after writing. To disable this behaviour, specify the option -Dorg.radarcns.deduplicate=false.


This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.