proteus-h2020/proteus-producer

Name: proteus-producer

Owner: PROTEUS

Description: The kafka producer for the PROTEUS project :rocket: :rocket:

Created: 2016-12-14 14:19:40.0

Updated: 2018-01-15 14:33:57.0

Pushed: 2017-08-30 07:40:14.0

Homepage: http://proteus-bigdata.com

Size: 568

Language: Java

GitHub Committers

UserMost Recent Commit# Commits

Other Committers

UserEmailMost Recent Commit# Commits

README

proteus-producer

Build Status

A Kafka Producer that is in charge of producing all the PROTEUS data in order to be consumed by different actors. This is intended to simulate the current industry-based scenario of AMII (sequential generation of multiple coils). To achieve this, the producer uses three different topics:

Getting started
Installing software dependencies

(see requirements section)

Moving files to HDFS

An important requirement before running the producer is to have the heterogeneous PROTEUS data (provided by e-mail to all the project partners) in an HDFS cluster (single-node deployments are also valid).

Use the following command to move your data (PROTEUS_HETEROGENEOUS_FILE.csv) to your HDFS:

 dfs -put <path_to_PROTEUS_HETEROGENEOUS_FILE.csv> /proteus/heterogeneous/final.csv

If you want to use a different HDFS location, you need to configure the variable com.treelogic.proteus.hdfs.streamingPath in the `src/main/resources/config.properties` before running the program.

Since HSM data is also managed by this producer (when a coil has finished, its corresponding HSM record is produced using the proteus-hsm topic), you need to move your HSM to your HDFS too:

 dfs -put <path_to_HSM_subset.csv> /proteus/hsm/HSM_subset.csv

If you want to use a different HDFS location, you need to configure the variable com.treelogic.proteus.hdfs.hsmPath in the `src/main/resources/config.properties` before running the program.

THe HSM_subset.csv file was also provided by e-mail to all the PROTEUS partners. Actually, this file (2GB) is a subset of the original HSM data (40GB), containing only those coils present in the real-time dataset (PROTEUS_HETEROGENEOUS_FILE.csv).

IMPORTANT If you need to use the HSM data for traning and learning purposes, please, keep in mind that the HSM_subset.csv is just a subset of the original HSM.

Creating Kafka topics

You need also to create the abovementioned kafka topics. You can use the following commands (by default, we create one partition per topic. This should be improved in the future):

/kafka/bin/kafka-topics.sh --zookeeper <your_zookeeper_url>:2181 --create --topic proteus-realtime --partitions 1 --replication-factor 1

/kafka/bin/kafka-topics.sh --zookeeper <your_zookeeper_url>:2181 --create --topic proteus-hsm --partitions 1 --replication-factor 1

/kafka/bin/kafka-topics.sh --zookeeper <your_zookeeper_url>:2181 --create --topic proteus-flatness --partitions 1 --replication-factor 1
How to run it

You can run the kafka producer in different ways. If you are using a terminal, please, run the following command.

exec:java 

If you want to run it in a production environmnets, the following command is recommended (run the producer as a background process):

p mvn exec:java &

If you want to import and run the project into your prefered IDE (e.g. eclipse, intellij), you need to import the maven project and execute the com.treelogic.proteus.Runner class.

Configuration

The following shows the default configuration of the producer, specified in the `src/main/resources/config.properties` file:

treelogic.proteus.hdfs.baseUrl=hdfs://192.168.4.245:8020 # Base URL of your HDFS
treelogic.proteus.hdfs.streamingPath=/proteus/heterogeneous/final.csv # Path to realtime data
treelogic.proteus.hdfs.hsmPath=/proteus/hsm/HSM_subset.csv #Path to HSM data

treelogic.proteus.kafka.bootstrapServers=clusterIDI.slave01.treelogic.local:6667,clusterIDI.slave02.treelogic.local:6667,clusterIDI.slave03.treelogic.local:6667 # Bootstrap servers
treelogic.proteus.kafka.topicName=proteus-realtime # Topic name of real-time data
treelogic.proteus.kafka.flatnessTopicName=proteus-flatness # Topic name of flatness data
treelogic.proteus.kafka.hsmTopicName=proteus-hsm # Topic name of HSM data

treelogic.proteus.model.timeBetweenCoils=10000 # The time (in ms) that the program takes between generation of different coils
treelogic.proteus.model.coilTime=120000 #The time (in ms) that the producer takes to produce a single coil
treelogic.proteus.model.flatnessDelay=20000 #When a coil finishes, the program schedules its corresponding flatness generation with a delay time here indicated

treelogic.proteus.model.hsm.splitter=;
Software Requirements
Logs and monitoring

immediately after running the program two logs files are created:

By default these files are created in the main directory (the same as the pom.xml is), but you can customize this in the src/main/resources/loback.xml. Both kafka and proteus logs are also printed to STDOUT.


This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.