Name: imllib-spark
Owner: Intel-bigdata
Description: null
Created: 2017-02-06 08:38:32.0
Updated: 2018-05-15 07:59:48.0
Pushed: 2017-10-31 09:19:25.0
Homepage: null
Size: 1264
Language: Terra
GitHub Committers
User | Most Recent Commit | # Commits |
---|
Other Committers
User | Most Recent Commit | # Commits |
---|
A package contains three Spark-based implementations. It includes
This package can be imported as a dependency in other codes. Then, all functions of LibFM, FFM, CRF and AdaOptimizer in this package can be used.
package
IMLLIB package can be either imported directly in Spark-Shell, or be imported as a dependency in other codes.
(1) Run Spark-Shell with IMLLIB package
k-shell --jars 'Path/imllib_2.11-0.0.1.jar'
Path
is the path of imllib_2.11-0.0.1.jar
(1) On driver node, add following codes into conf/spark-default.conf
k.executor.extraClassPath /usr/local/spark/lib/*
k.driver.extraClassPath /usr/local/spark/lib/*
(2) Create /usr/local/spark/lib
(3) Copy imllib_2.11-0.0.1.jar
to /usr/local/spark/lib
(4) Copy conf/spark-default.conf
and /usr/local/spark/lib
to all worker nodes
(5) Run Spark-Shell
k-shell
(1) build from source and publish locally
compile publish-local
(2) Move the whole directory com.intel
from .ivy2/local
to .ivy2/cache
(3) Add following codes into build.sbt
when you want to import IMLLIB package as a denpendency
aryDependencies += "com.intel" % "imllib_2.11" % "0.0.1"
rt com.intel.imllib._
There are three shell scripts in bin/template
for testing LibFM, FFM, CRF and LR with AdaOptimizer respectively. The script runs in a local mode Spark with the data on hadoop.
You can first modify the script with necessary changes, such as hostname, port for hadoop, etc. Then run the script to test if the algorithm works.
A Spark-based implementation of Factorization Machines (LibFM)
The code base structure of this project is from spark-libFM, but the optimization method is based on parallel-sgd which has stronger convergence than miniBatch-sgd.
A Spark-based implementation of Field-Awared Factorization Machine with parallelled AdaGrad solver. See http://www.csie.ntu.edu.tw/~cjlin/papers/ffm.pdf
Need to rework the data format the fit FFM, that is to extends LIBSVM data format by adding field information to each feature to have formation like:
label field1:feat1:val1 field2:feat2:val2
A Spark-based implementation of Conditional Random Fields (CRF) for segmenting/labeling sequential data.
CRF-Spark
provides following features:
A Spark-based implementation of Adam and AdaGrad optimizer, methods for Stochastic Optimization. See https://arxiv.org/abs/1412.6980 and http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf Comparing with SGD, Adam and AdaGrad have better performance. Especially in case of sparse features, Adam can converge faster than normal SGD.
If you encounter bugs, feel free to submit an issue or pull request. Also you can mail to: