Name: fusion-spark-bootcamp
Owner: Lucidworks
Description: Fusion Spark Bootcamp
Created: 2016-08-09 18:56:20.0
Updated: 2018-04-23 20:05:56.0
Pushed: 2018-04-23 20:05:54.0
Homepage: null
Size: 41288
Language: Shell
GitHub Committers
User | Most Recent Commit | # Commits |
---|
Other Committers
User | Most Recent Commit | # Commits |
---|
This project contains examples and labs for learning how to use Fusion's Spark features.
Download and install the latest version of Fusion 4.0.x from lucidworks.com. Take note of the location where you installed Fusion, such as /opt/lucidworks/fusion/4.0.0
. We'll refer to this location as $FUSION_HOME hereafter.
Start Fusion by doing:
FUSION_HOME
fusion start
Login to the Fusion Admin UI in your browser at: http://localhost:8764
If this is the first time you're running Fusion, then you will be prompted to set a password for the default “admin” user.
Edit the myenv.sh
script in this project to set environment specific variables used by lab setup scripts.
The default mode of Fusion 4.0.x is to run Spark in local mode, but for these labs, we recommend starting the Fusion Spark Master and Worker processes.
FUSION_HOME
spark-master start
spark-worker start
Open the Spark Master UI at http://localhost:8767 and verify the cluster is alive and has an active worker process.
Lastly, please launch the Fusion Spark shell to verify all Fusion processes are configured and running correctly.
FUSION_HOME
spark-shell
This lab requires Fusion 4.0.0 or later.
Run the labs/apachelogs/setup_apachelogs.sh
script to create the Fusion objects needed to support this lab.
The setup script will index some sample log entries using the Fusion log indexer, see: Fusion Log Indexer
Once indexed, the setup script registers a Scala script (labs/apachelogs/sessionize.scala
) as a custom script job in Fusion and then runs the job. Note that the scala script has to be converted into JSON (see job.json) when submitting to Fusion.
When the job finishes, check the aggregated results in the apachelogs_signals_aggr collection. You may need to send a hard commit to make sure all records are committed to the index:
"http://localhost:8983/solr/apachelogs_signals_aggr/update?commit=true"
This lab demonstrates how to read time-partitioned data using Spark.
Run the labs/eventsim/setup_eventsim.sh
script to create the Fusion objects needed to support this lab.
The setup script launches the Fusion spark-shell to index 180,981 sample documents (generated from the eventsim project).
This lab demonstrates how to deploy a Spark ML pipeline based classifier to predict a classification during indexing; requires Fusion 4.0.0 or later.
NOTE: If you're running a multi-node Fusion cluster, then you need to run the setup script for this lab on a node that is running the connectors-classic service.
Run the labs/ml20news/setup_ml20news.sh
script to create the Fusion objects needed to run this lab.
To see how the model for this lab was trained, see: MLPipelineScala.scala
This lab demonstrates how to deploy a Spark MLlib based classifier to predict sentiment during indexing; requires Fusion 4.0.0. or later.
Run the labs/mlsvm/setup_mlsvm.sh
script to create the Fusion objects needed to run this lab.
To see how the model for this lab was trained, see: SVMExample.scala
This lab requires Fusion 4.0 or later.
Run the labs/movielens/setup_movielens.sh
script to create collections in Fusion and populate them using Spark.
The setup script downloads the ml-100k data set from http://files.grouplens.org/datasets/movielens/ml-100k.zip and extracts it to labs/movielens/ml-100k
.
You'll need unzip installed prior to running the script.
The setup script also invokes the Spark shell to load data into Solr, see the load_solr.scala script for more details.
Behind the scenes, the setup script launches the Spark shell in Fusion by doing:
ION_HOME/bin/spark-shell -i load_solr.scala
After loading the data, the setup script will (re)start the Fusion SQL engine using:
ION_HOME/bin/sql restart
Test the Catalog API endpoint by executing the explore_movielens.sh
script.
NOTE: It make take a few seconds the first time you run a query for Spark to distribute the Fusion shaded JAR to worker processes.
You can tune the resource allocation for the SQL engine so that it has a little more memory and CPU resources. Specifically, we'll give it 6 CPU cores and 2g of memory; feel free to adjust these settings for your workstation.
-u admin:password123 -H 'Content-type:application/json' -X PUT -d '6' "http://localhost:8764/api/apollo/configurations/fusion.sql.cores"
-u admin:password123 -H 'Content-type:application/json' -X PUT -d '6' "http://localhost:8764/api/apollo/configurations/fusion.sql.executor.cores"
-u admin:password123 -H 'Content-type:application/json' -X PUT -d '2g' "http://localhost:8764/api/apollo/configurations/fusion.sql.memory"
-u admin:password123 -H 'Content-type:application/json' -X PUT -d '6' "http://localhost:8764/api/apollo/configurations/fusion.sql.default.shuffle.partitions"
You'll need to restart the SQL engine after making these changes.