lucidworks/fusion-spark-xfer

Name: fusion-spark-xfer

Owner: Lucidworks

Description: Spark submit app for transfering data from one collection to another potentially across clusters

Created: 2018-01-19 15:47:46.0

Updated: 2018-05-09 21:15:19.0

Pushed: 2018-05-09 21:15:18.0

Homepage: null

Size: 14

Language: Scala

GitHub Committers

UserMost Recent Commit# Commits

Other Committers

UserEmailMost Recent Commit# Commits

README

Fusion Collection Transfer App

Spark submit app for transfering data from one collection to another potentially across clusters.

Getting Started

You'll need Fusion 3.1.3+

Configure the parameters to submit the job to Fusion's Spark master in the fusion-xfer.sh script, such as:

in/bash
ON_HOME=/opt/fusion/3.1.3
K_MASTER=local[*]
JAR=/opt/fusion-spark-xfer/target/fusion-spark-xfer-1.0-shaded.jar
ION_HOME/apps/spark-dist/bin/spark-submit --master $SPARK_MASTER \
lass com.lucidworks.spark.CollectionTransferApp $APP_JAR \
estinationSolrClusterZk localhost:9983/lwfusion/3.1.3/solr \
estinationCollection dest_signals \
ourceSolrClusterZk localhost:9983/lwfusion/3.1.3/solr \
ourceCollection source_signals

Configure job resource allocation using standard Spark submit options, see: $FUSION_HOME/apps/spark-dist/bin/spark-submit --help

To get the active Spark master for Fusion, do: curl http://localhost:8765/api/v1/spark/master

CollectionTransferApp Options:

--batchSize <arg>                  Batch size for writing docs to the destination cluster; defaults to 10000

--destinationCollection <arg>      Name of the Solr collection on the destination cluster to write data to; uses source name if not provided

--destinationSolrClusterZk <arg>   ZooKeeper connection string for the Solr cluster this app transfers data to

--findNewOnly true|false           Flag to indicate if this app should look for new docs in the source using the latest timestamp in the
                                   destination; defaults to true, set to false to skip this check and pull all docs that match the source query

--sourceCollection <arg>           Name of the Solr collection on the source cluster to read data from

--sourceQuery <arg>                Query to source collection for docs to transfer; uses *:* if not provided

--sourceSolrClusterZk <arg>        ZooKeeper connection string for the Solr cluster this app transfers data from

--sparkConf <arg>                  Additional Spark configuration properties file

--timestampField <arg>             Timestamp field name on docs; defaults to 'timestamp_tdt'

--verbose                          Generate verbose log messages

This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.