Name: Restructure-HDFS-topic
Owner: RADAR-CNS
Owner: RADAR-CNS
Description: Reads avro files in HDFS and outputs json per topic per user in local file system
Created: 2017-04-13 08:49:55.0
Updated: 2017-08-03 18:18:40.0
Pushed: 2017-11-15 11:30:51.0
Size: 189
Language: Java
GitHub Committers
User | Most Recent Commit | # Commits |
---|
Other Committers
User | Most Recent Commit | # Commits |
---|
Data streamed to HDFS using the RADAR HDFS sink connector is streamed to files based on sensor only. This package can transform that output to a local directory structure as follows: userId/topic/date_hour.csv
. The date and hour is extracted from the time
field of each record, and is formatted in UTC time.
This package is included in the RADAR-Docker repository, in the dcompose/radar-cp-hadoop-stack/hdfs_restructure.sh
script.
Build jar from source with
adlew build
and find the output JAR file as build/libs/restructurehdfs-0.3.1-all.jar
. Then run with:
-jar restructurehdfs-0.3.1-all.jar <webhdfs_url> <hdfs_topic_path> <output_folder>
By default, this will output the data in CSV format. If JSON format is preferred, use the following instead:
-Dorg.radarcns.format=json -jar restructurehdfs-0.3.1-all.jar <webhdfs_url> <hdfs_topic_path> <output_folder>
Another option is to output the data in compressed form. All files will get the gz
suffix, and can be decompressed with a GZIP decoder. Note that for a very small number of records, this may actually increase the file size.
-Dorg.radarcns.compress=gzip -jar restructurehdfs-0.3.1-all.jar <webhdfs_url> <hdfs_topic_path> <output_folder>
Finally, files records are deduplicated after writing. To disable this behaviour, specify the option -Dorg.radarcns.deduplicate=false
.