Name: ld4p-data-pipeline
Owner: Stanford University Digital Library
Owner: sul-dlss-labs
Description: Scala/Kafka/Spark Linked Data Pipeline
Created: 2017-08-01 22:59:01.0
Updated: 2017-10-01 01:58:03.0
Pushed: 2017-11-17 22:03:42.0
Size: 211
Language: XSLT
GitHub Committers
User | Most Recent Commit | # Commits |
---|
Other Committers
User | Most Recent Commit | # Commits |
---|
Project Management : https://ld-stanford.myjetbrains.com/youtrack/
Check the Wiki Pages for a comprehensive introduction to the project
(This is in progress)
Scala
: 2.11.11Scala Build Tool
Kafka
: 0.11.xSpark
: 2.2.0, Pre-built for Apache Hadoop 2.7BananaRDF
(see note below on installing)Unfortunately the BananaRDF artifacts for the version of Scala we use are not available online. Hence Banana RDF needs to be cloned, compiled and published locally before compiling this project.
clone https://github.com/banana-rdf/banana-rdf
banana-rdf
++2.11.8 publishLocal
These steps walk through setting up a development environment for the ReactiveKafkaWriter
module primarily, but should be generalizable to other Scala modules in this codebase.
Clone this ld4p-data-pipeline git repository.
sbt
to compile/assembleSBT resolves dependencies, compiles Java byte code and builds jar files:
assembly
builds subproject jar(s). From top level, without subproject set, it compiles everything, assembles for each project (so you probably want to switch into the individual subproject you are working on first). Assembly artifacts could be produced to anywhere and project can utilize hierarchy in the project listing, if desired.
Moving around SBT is not like filesystem directory. Example usage:
ojects # lists all projects available in the environment
oject # shows the current project
oject ReactiveKafkaWriter # sets the sbt console current project to ReactiveKafkaWriter (or whatever)
mpile # compiles the current project (ReactiveKafkaWriter)
sembly # builds the über jar for your current project
sks # shows tasks available; same conventions as with Maven
brew install kafka
).profile
as needed (export KAFKA_HOME
, update PATH
)brew services start zookeeper
/ brew services start kafka
bin/zookeeper-server-start.sh config/zookeeper.properties
bin/kafka-server-start.sh config/server.properties
Right now, there is no Kafka GUI, but you can still use the CLI tools, located in /usr/bin
where you installed Kafka.
To check if ZooKeeper is running and OK:
Manual installation:
.profile
(export SPARK_HOME
, update PATH
)Sbin within script: sbin/start-master.sh
(dependent on where you installed Spark locally)
sbin/start-all.sh
to start the master and “agent” (slave) systems.About Spark cluster management / cluster environment:
start-slave
doesn?t start unless you point to something, but start-all
doesn?t require thatDeploy the App to Spark, e.g.:
k-submit --class EstimatorStreamingApp --name EstimatorStreamingApp --master spark:/SPARK-MASTER-URL.local:7077 --deploy-mode cluster --executor-memory 14G --num-executors 2
You can retrieve the SPARK-MASTER-URL
(and Slave Node URLs) from the GUI.
Note on Local / local-cluster / cluster options: deploy-mode has nothing to do with how you run the application, its only for launching the application.
Then start the application via your generated Jar, like:
-jar ld4p-data-pipeline/EstimatorStreamingApp/target/scala-2.11/EstimatorStreamingApp-assembly-1.0.0-SNAPSHOT.jar
Depending on where you installed, you might need to configure logging or use sudo
to allow logs to be written.
There are options for auto-creating topics in Kafka when that topic is written to. However, there is an error of Kafka auto creating that topic when Spark was looking to create, so it is recommended to keep that option disabled.
These steps start kafka, create a topic named marc21
and display info on the created topic, respectively:
a-server-start.sh -daemon ~/Dev/kafka/config/server.properties
a-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 16 --topic marc21
a-topics.sh --describe --zookeeper localhost:2181 --topic marc21
FYI: Deleting topics doesn?t actually delete the topic, only marks it as “deleted”. In Kafka server.properties
file, near the top, this behavior can be changed (“delete topic enable”), if needed.
Need data files and a given directory structure. First, be on the Stanford VPN. Then:
r -p ~/Dev/data
d ~/Dev/data
t
ld4p@sul-ld4p-converter-dev.stanford.edu/ld4pData.1.zip ./
p ld4pData.1.zip
Note: This data is needed identically on each machine for clusters to work from filesystem data.
dataDir
dataDir
, the location where filesystem data is read, is set either through application.conf
file(s) or ENV variable.
Default value: ${HOME}/Dev/data/ld4pData
. If you use a different location, adjust your conf files or ENV accordingly.
Currently, some original structure is expected underneath that, as provided by the zip file.
The class is configured to read from the Casalini MARC data as downloaded & configured above (36k records).
java -jar ReactiveKafkaWriter/target/scala-2.11/ReactiveKafkaWriter-assembly-1.0.0-SNAPSHOP.jar
To stop properly:
When restarting the Spark streaming app:
If using homebrew, can run brew services stop kafka
then brew services stop zookeeper
Provision AWS systems, e.g. use
Once the AWS systems are available, setup ~/.ssh/config
and /etc/hosts
, e.g.
tc/hosts
_public_ip} ld4p_dev_spark_master
_public_ip} ld4p_dev_spark_worker1
us any additional worker nodes
.ssh/config
ld4p_dev_spark_master
User {aws_user}
Hostname {use /etc/hosts name}
IdentityFile ~/.ssh/{key-pair}.pem
Port 22
ld4p_dev_spark_worker1
User {aws_user}
Hostname {use /etc/hosts name}
IdentityFile ~/.ssh/{key-pair}.pem
Port 22
us any additional worker nodes
Then the usual capistrano workflow can be used, i.e.
le install
le exec cap -T
le exec cap ld4p_dev deploy:check
le exec cap ld4p_dev deploy
le exec cap ld4p_dev shell
Once the project is deployed to all the servers, run the assembly task for a project, e.g.
le exec cap ld4p_dev spark:assembly
le exec cap ld4p_dev stardog:assembly
Project Management : https://ld-stanford.myjetbrains.com/youtrack/
Check the Wiki Pages for a comprehensive introduction to the project
(This is in progress)
Scala
: 2.11.11Scala Build Tool
Kafka
: 0.11.xSpark
: 2.2.0, Pre-built for Apache Hadoop 2.7BananaRDF
(see note below on installing)Unfortunately the BananaRDF artifacts for the version of Scala we use are not available online. Hence Banana RDF needs to be cloned, compiled and published locally before compiling this project.
clone https://github.com/banana-rdf/banana-rdf
banana-rdf
++2.11.8 publishLocal
These steps walk through setting up a development environment for the ReactiveKafkaWriter
module primarily, but should be generalizable to other Scala modules in this codebase.
Clone this ld4p-data-pipeline git repository.
sbt
to compile/assembleSBT resolves dependencies, compiles Java byte code and builds jar files:
assembly
builds subproject jar(s). From top level, without subproject set, it compiles everything, assembles for each project (so you probably want to switch into the individual subproject you are working on first). Assembly artifacts could be produced to anywhere and project can utilize hierarchy in the project listing, if desired.
Moving around SBT is not like filesystem directory. Example usage:
ojects # lists all projects available in the environment
oject # shows the current project
oject ReactiveKafkaWriter # sets the sbt console current project to ReactiveKafkaWriter (or whatever)
mpile # compiles the current project (ReactiveKafkaWriter)
sembly # builds the über jar for your current project
sks # shows tasks available; same conventions as with Maven
brew install kafka
).profile
as needed (export KAFKA_HOME
, update PATH
)brew services start zookeeper
/ brew services start kafka
bin/zookeeper-server-start.sh config/zookeeper.properties
bin/kafka-server-start.sh config/server.properties
Right now, there is no Kafka GUI, but you can still use the CLI tools, located in /usr/bin
where you installed Kafka.
To check if ZooKeeper is running and OK:
Manual installation:
.profile
(export SPARK_HOME
, update PATH
)Sbin within script: sbin/start-master.sh
(dependent on where you installed Spark locally)
sbin/start-all.sh
to start the master and “agent” (slave) systems.About Spark cluster management / cluster environment:
start-slave
doesn?t start unless you point to something, but start-all
doesn?t require thatDeploy the App to Spark, e.g.:
k-submit --class EstimatorStreamingApp --name EstimatorStreamingApp --master spark:/SPARK-MASTER-URL.local:7077 --deploy-mode cluster --executor-memory 14G --num-executors 2
You can retrieve the SPARK-MASTER-URL
(and Slave Node URLs) from the GUI.
Note on Local / local-cluster / cluster options: deploy-mode has nothing to do with how you run the application, its only for launching the application.
Then start the application via your generated Jar, like:
-jar ld4p-data-pipeline/EstimatorStreamingApp/target/scala-2.11/EstimatorStreamingApp-assembly-1.0.0-SNAPSHOT.jar
Depending on where you installed, you might need to configure logging or use sudo
to allow logs to be written.
There are options for auto-creating topics in Kafka when that topic is written to. However, there is an error of Kafka auto creating that topic when Spark was looking to create, so it is recommended to keep that option disabled.
These steps start kafka, create a topic named marc21
and display info on the created topic, respectively:
a-server-start.sh -daemon ~/Dev/kafka/config/server.properties
a-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 16 --topic marc21
a-topics.sh --describe --zookeeper localhost:2181 --topic marc21
FYI: Deleting topics doesn?t actually delete the topic, only marks it as “deleted”. In Kafka server.properties
file, near the top, this behavior can be changed (“delete topic enable”), if needed.
Need data files and a given directory structure. First, be on the Stanford VPN. Then:
r -p ~/Dev/data
d ~/Dev/data
t
ld4p@sul-ld4p-converter-dev.stanford.edu/ld4pData.1.zip ./
p ld4pData.1.zip
Note: This data is needed identically on each machine for clusters to work from filesystem data.
dataDir
dataDir
, the location where filesystem data is read, is set either through application.conf
file(s) or ENV variable.
Default value: ${HOME}/Dev/data/ld4pData
. If you use a different location, adjust your conf files or ENV accordingly.
Currently, some original structure is expected underneath that, as provided by the zip file.
The class is configured to read from the Casalini MARC data as downloaded & configured above (36k records).
java -jar ReactiveKafkaWriter/target/scala-2.11/ReactiveKafkaWriter-assembly-1.0.0-SNAPSHOP.jar
To stop properly:
When restarting the Spark streaming app:
If using homebrew, can run brew services stop kafka
then brew services stop zookeeper
Provision AWS systems, e.g. use
Once the AWS systems are available, setup ~/.ssh/config
and /etc/hosts
, e.g.
tc/hosts
_public_ip} ld4p_dev_spark_master
_public_ip} ld4p_dev_spark_worker1
us any additional worker nodes
.ssh/config
ld4p_dev_spark_master
User {aws_user}
Hostname {use /etc/hosts name}
IdentityFile ~/.ssh/{key-pair}.pem
Port 22
ld4p_dev_spark_worker1
User {aws_user}
Hostname {use /etc/hosts name}
IdentityFile ~/.ssh/{key-pair}.pem
Port 22
us any additional worker nodes
Then the usual capistrano workflow can be used, i.e.
le install
le exec cap -T
le exec cap ld4p_dev deploy:check
le exec cap ld4p_dev deploy
le exec cap ld4p_dev shell
Once the project is deployed to all the servers, run the assembly task for a project, e.g.
le exec cap ld4p_dev spark:assembly
le exec cap ld4p_dev stardog:assembly