Intel-bigdata/imllib-spark

Name: imllib-spark

Owner: Intel-bigdata

Description: null

Created: 2017-02-06 08:38:32.0

Updated: 2018-05-15 07:59:48.0

Pushed: 2017-10-31 09:19:25.0

Homepage: null

Size: 1264

Language: Terra

GitHub Committers

UserMost Recent Commit# Commits

Other Committers

UserEmailMost Recent Commit# Commits

README

IMLLIB

Build Status

A package contains three Spark-based implementations. It includes

This package can be imported as a dependency in other codes. Then, all functions of LibFM, FFM, CRF and AdaOptimizer in this package can be used.

Build from Source

package

How to Use

IMLLIB package can be either imported directly in Spark-Shell, or be imported as a dependency in other codes.

Use in Spark-Shell
Temporary Use

(1) Run Spark-Shell with IMLLIB package

k-shell --jars 'Path/imllib_2.11-0.0.1.jar'

Path is the path of imllib_2.11-0.0.1.jar

Permanent Use

(1) On driver node, add following codes into conf/spark-default.conf

k.executor.extraClassPath    /usr/local/spark/lib/*
k.driver.extraClassPath      /usr/local/spark/lib/*

(2) Create /usr/local/spark/lib
(3) Copy imllib_2.11-0.0.1.jar to /usr/local/spark/lib
(4) Copy conf/spark-default.conf and /usr/local/spark/lib to all worker nodes
(5) Run Spark-Shell

k-shell
Use as a denpendency

(1) build from source and publish locally

compile publish-local

(2) Move the whole directory com.intel from .ivy2/local to .ivy2/cache
(3) Add following codes into build.sbt when you want to import IMLLIB package as a denpendency

aryDependencies += "com.intel" % "imllib_2.11" % "0.0.1"

How to Import

rt com.intel.imllib._

Test Examples

There are three shell scripts in bin/template for testing LibFM, FFM, CRF and LR with AdaOptimizer respectively. The script runs in a local mode Spark with the data on hadoop. You can first modify the script with necessary changes, such as hostname, port for hadoop, etc. Then run the script to test if the algorithm works.


FM-Spark

A Spark-based implementation of Factorization Machines (LibFM)

The code base structure of this project is from spark-libFM, but the optimization method is based on parallel-sgd which has stronger convergence than miniBatch-sgd.

FFM-Spark

A Spark-based implementation of Field-Awared Factorization Machine with parallelled AdaGrad solver. See http://www.csie.ntu.edu.tw/~cjlin/papers/ffm.pdf

Need to rework the data format the fit FFM, that is to extends LIBSVM data format by adding field information to each feature to have formation like:

    label field1:feat1:val1 field2:feat2:val2
CRF-Spark

A Spark-based implementation of Conditional Random Fields (CRF) for segmenting/labeling sequential data.

CRF-Spark provides following features:

AdaOptimizer

A Spark-based implementation of Adam and AdaGrad optimizer, methods for Stochastic Optimization. See https://arxiv.org/abs/1412.6980 and http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf Comparing with SGD, Adam and AdaGrad have better performance. Especially in case of sparse features, Adam can converge faster than normal SGD.

Contact & Feedback

If you encounter bugs, feel free to submit an issue or pull request. Also you can mail to:


This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.