sul-dlss/ld4p-data-pipeline

Name: ld4p-data-pipeline

Owner: Stanford University Digital Library

Owner: sul-dlss-labs

Description: Scala/Kafka/Spark Linked Data Pipeline

Created: 2017-08-01 22:59:01.0

Updated: 2017-10-01 01:58:03.0

Pushed: 2017-11-17 22:03:42.0

Homepage:

Size: 211

Language: XSLT

GitHub Committers

UserMost Recent Commit# Commits

Other Committers

UserEmailMost Recent Commit# Commits

README

Build Status

ld4p-data-pipeline

Project Management : https://ld-stanford.myjetbrains.com/youtrack/

Check the Wiki Pages for a comprehensive introduction to the project

Requirements

(This is in progress)

Unfortunately the BananaRDF artifacts for the version of Scala we use are not available online. Hence Banana RDF needs to be cloned, compiled and published locally before compiling this project.

clone https://github.com/banana-rdf/banana-rdf
banana-rdf
++2.11.8 publishLocal
Installation

These steps walk through setting up a development environment for the ReactiveKafkaWriter module primarily, but should be generalizable to other Scala modules in this codebase.

1. Download the Project

Clone this ld4p-data-pipeline git repository.

2. Use sbt to compile/assemble

SBT resolves dependencies, compiles Java byte code and builds jar files:

assembly builds subproject jar(s). From top level, without subproject set, it compiles everything, assembles for each project (so you probably want to switch into the individual subproject you are working on first). Assembly artifacts could be produced to anywhere and project can utilize hierarchy in the project listing, if desired.

Moving around SBT is not like filesystem directory. Example usage:

ojects # lists all projects available in the environment
oject  # shows the current project
oject ReactiveKafkaWriter # sets the sbt console current project to ReactiveKafkaWriter (or whatever)
mpile  # compiles the current project (ReactiveKafkaWriter)
sembly # builds the über jar for your current project
sks    # shows tasks available; same conventions as with Maven
3. Have Kafka & Spark Running Locally
Kafka

Right now, there is no Kafka GUI, but you can still use the CLI tools, located in /usr/bin where you installed Kafka.

To check if ZooKeeper is running and OK:

Spark

Manual installation:

Sbin within script: sbin/start-master.sh (dependent on where you installed Spark locally)

About Spark cluster management / cluster environment:

4. Launch the Application

Deploy the App to Spark, e.g.:

k-submit --class EstimatorStreamingApp --name EstimatorStreamingApp --master spark:/SPARK-MASTER-URL.local:7077 --deploy-mode cluster --executor-memory 14G --num-executors 2

You can retrieve the SPARK-MASTER-URL (and Slave Node URLs) from the GUI.

Note on Local / local-cluster / cluster options: deploy-mode has nothing to do with how you run the application, its only for launching the application.

Then start the application via your generated Jar, like:

 -jar ld4p-data-pipeline/EstimatorStreamingApp/target/scala-2.11/EstimatorStreamingApp-assembly-1.0.0-SNAPSHOT.jar

Depending on where you installed, you might need to configure logging or use sudo to allow logs to be written.

5. Create a Kafka Topic for the MARC

There are options for auto-creating topics in Kafka when that topic is written to. However, there is an error of Kafka auto creating that topic when Spark was looking to create, so it is recommended to keep that option disabled.

These steps start kafka, create a topic named marc21 and display info on the created topic, respectively:

a-server-start.sh -daemon ~/Dev/kafka/config/server.properties
a-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 16 --topic marc21
a-topics.sh --describe --zookeeper localhost:2181 --topic marc21

FYI: Deleting topics doesn?t actually delete the topic, only marks it as “deleted”. In Kafka server.properties file, near the top, this behavior can be changed (“delete topic enable”), if needed.

6. Grab the MARC21 data in structured directories

Need data files and a given directory structure. First, be on the Stanford VPN. Then:

r -p ~/Dev/data
d ~/Dev/data
t
ld4p@sul-ld4p-converter-dev.stanford.edu/ld4pData.1.zip ./
p ld4pData.1.zip

Note: This data is needed identically on each machine for clusters to work from filesystem data.

dataDir

dataDir, the location where filesystem data is read, is set either through application.conf file(s) or ENV variable. Default value: ${HOME}/Dev/data/ld4pData. If you use a different location, adjust your conf files or ENV accordingly.

Currently, some original structure is expected underneath that, as provided by the zip file.

7. ReactiveKafkaWriter is ready to be run

The class is configured to read from the Casalini MARC data as downloaded & configured above (36k records).

8. How to kill a Spark job:

To stop properly:

When restarting the Spark streaming app:

9. Stop / Shut down your Kafka and Zoopeer instances

If using homebrew, can run brew services stop kafka then brew services stop zookeeper

10. Deployment on Amazon via Capistrano

Provision AWS systems, e.g. use

Once the AWS systems are available, setup ~/.ssh/config and /etc/hosts, e.g.

tc/hosts
_public_ip}  ld4p_dev_spark_master
_public_ip}  ld4p_dev_spark_worker1
us any additional worker nodes

.ssh/config

 ld4p_dev_spark_master
User {aws_user}
Hostname {use /etc/hosts name}
IdentityFile ~/.ssh/{key-pair}.pem
Port 22

 ld4p_dev_spark_worker1
User {aws_user}
Hostname {use /etc/hosts name}
IdentityFile ~/.ssh/{key-pair}.pem    
Port 22

us any additional worker nodes

Then the usual capistrano workflow can be used, i.e.

le install
le exec cap -T
le exec cap ld4p_dev deploy:check
le exec cap ld4p_dev deploy
le exec cap ld4p_dev shell

Once the project is deployed to all the servers, run the assembly task for a project, e.g.

le exec cap ld4p_dev spark:assembly
le exec cap ld4p_dev stardog:assembly

Build Status

ld4p-data-pipeline

Project Management : https://ld-stanford.myjetbrains.com/youtrack/

Check the Wiki Pages for a comprehensive introduction to the project

Requirements

(This is in progress)

Unfortunately the BananaRDF artifacts for the version of Scala we use are not available online. Hence Banana RDF needs to be cloned, compiled and published locally before compiling this project.

clone https://github.com/banana-rdf/banana-rdf
banana-rdf
++2.11.8 publishLocal
Installation

These steps walk through setting up a development environment for the ReactiveKafkaWriter module primarily, but should be generalizable to other Scala modules in this codebase.

1. Download the Project

Clone this ld4p-data-pipeline git repository.

2. Use sbt to compile/assemble

SBT resolves dependencies, compiles Java byte code and builds jar files:

assembly builds subproject jar(s). From top level, without subproject set, it compiles everything, assembles for each project (so you probably want to switch into the individual subproject you are working on first). Assembly artifacts could be produced to anywhere and project can utilize hierarchy in the project listing, if desired.

Moving around SBT is not like filesystem directory. Example usage:

ojects # lists all projects available in the environment
oject  # shows the current project
oject ReactiveKafkaWriter # sets the sbt console current project to ReactiveKafkaWriter (or whatever)
mpile  # compiles the current project (ReactiveKafkaWriter)
sembly # builds the über jar for your current project
sks    # shows tasks available; same conventions as with Maven
3. Have Kafka & Spark Running Locally
Kafka

Right now, there is no Kafka GUI, but you can still use the CLI tools, located in /usr/bin where you installed Kafka.

To check if ZooKeeper is running and OK:

Spark

Manual installation:

Sbin within script: sbin/start-master.sh (dependent on where you installed Spark locally)

About Spark cluster management / cluster environment:

4. Launch the Application

Deploy the App to Spark, e.g.:

k-submit --class EstimatorStreamingApp --name EstimatorStreamingApp --master spark:/SPARK-MASTER-URL.local:7077 --deploy-mode cluster --executor-memory 14G --num-executors 2

You can retrieve the SPARK-MASTER-URL (and Slave Node URLs) from the GUI.

Note on Local / local-cluster / cluster options: deploy-mode has nothing to do with how you run the application, its only for launching the application.

Then start the application via your generated Jar, like:

 -jar ld4p-data-pipeline/EstimatorStreamingApp/target/scala-2.11/EstimatorStreamingApp-assembly-1.0.0-SNAPSHOT.jar

Depending on where you installed, you might need to configure logging or use sudo to allow logs to be written.

5. Create a Kafka Topic for the MARC

There are options for auto-creating topics in Kafka when that topic is written to. However, there is an error of Kafka auto creating that topic when Spark was looking to create, so it is recommended to keep that option disabled.

These steps start kafka, create a topic named marc21 and display info on the created topic, respectively:

a-server-start.sh -daemon ~/Dev/kafka/config/server.properties
a-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 16 --topic marc21
a-topics.sh --describe --zookeeper localhost:2181 --topic marc21

FYI: Deleting topics doesn?t actually delete the topic, only marks it as “deleted”. In Kafka server.properties file, near the top, this behavior can be changed (“delete topic enable”), if needed.

6. Grab the MARC21 data in structured directories

Need data files and a given directory structure. First, be on the Stanford VPN. Then:

r -p ~/Dev/data
d ~/Dev/data
t
ld4p@sul-ld4p-converter-dev.stanford.edu/ld4pData.1.zip ./
p ld4pData.1.zip

Note: This data is needed identically on each machine for clusters to work from filesystem data.

dataDir

dataDir, the location where filesystem data is read, is set either through application.conf file(s) or ENV variable. Default value: ${HOME}/Dev/data/ld4pData. If you use a different location, adjust your conf files or ENV accordingly.

Currently, some original structure is expected underneath that, as provided by the zip file.

7. ReactiveKafkaWriter is ready to be run

The class is configured to read from the Casalini MARC data as downloaded & configured above (36k records).

8. How to kill a Spark job:

To stop properly:

When restarting the Spark streaming app:

9. Stop / Shut down your Kafka and Zoopeer instances

If using homebrew, can run brew services stop kafka then brew services stop zookeeper

10. Deployment on Amazon via Capistrano

Provision AWS systems, e.g. use

Once the AWS systems are available, setup ~/.ssh/config and /etc/hosts, e.g.

tc/hosts
_public_ip}  ld4p_dev_spark_master
_public_ip}  ld4p_dev_spark_worker1
us any additional worker nodes

.ssh/config

 ld4p_dev_spark_master
User {aws_user}
Hostname {use /etc/hosts name}
IdentityFile ~/.ssh/{key-pair}.pem
Port 22

 ld4p_dev_spark_worker1
User {aws_user}
Hostname {use /etc/hosts name}
IdentityFile ~/.ssh/{key-pair}.pem    
Port 22

us any additional worker nodes

Then the usual capistrano workflow can be used, i.e.

le install
le exec cap -T
le exec cap ld4p_dev deploy:check
le exec cap ld4p_dev deploy
le exec cap ld4p_dev shell

Once the project is deployed to all the servers, run the assembly task for a project, e.g.

le exec cap ld4p_dev spark:assembly
le exec cap ld4p_dev stardog:assembly

This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.