Intel-bigdata/CRF-Spark

Name: CRF-Spark

Owner: Intel-bigdata

Description: A Spark-based implementation of Conditional Random Fields (CRFs) for segmenting/labeling sequential data

Created: 2016-02-18 08:22:37.0

Updated: 2018-05-20 17:05:48.0

Pushed: 2017-07-05 12:32:51.0

Homepage:

Size: 249

Language: Scala

GitHub Committers

UserMost Recent Commit# Commits

Other Committers

UserEmailMost Recent Commit# Commits

README

CRF-Spark

A Spark-based implementation of Conditional Random Fields (CRFs) for segmenting/labeling sequential data. Basically this project is deprecated, and we will not maintain it any more, please check the new project at https://github.com/Intel-bigdata/imllib-spark.

Requirements

This documentation is for Spark 1.4+. Other version will probably work yet not tested.

Features

CRF-Spark provides following features:

Applications

A web-based application of this package to Chinese POS Tagging.

Example
Scala API
l template = Array("U00:%x[-1,0]", "U01:%x[0,0]", "U02:%x[1,0]", "B")

l train1 = Array("B-NP|--|Friday|-|NNP\tB-NP|--|'s|-|POS", "I-NP|--|Market|-|NNP\tI-NP|--|Activity|-|NN")
l test1 = Array("null|--|Market|-|NNP\tnull|--|Activity|-|NN")
l trainRdd1 = sc.parallelize(train).map(Sequence.deSerializer)
l model1 = CRF.train(template, trainRdd)
l result1 = model.predict(test.map(Sequence.deSerializer))

l train2 = Array("Friday NNP B-NP\nB-NP 's POS", "Market NNP I-NP\nActivity NN I-NP")
l test2 = Array("Market NNP\nActivity NN")
l trainRdd2 = sc.parallelize(train2).map(sentence => {
val tokens = sentence.split("\n")
Sequence(tokens.map(token => {
  val tags: Array[String] = token.split(' ')
  Token.put(tags.last, tags.dropRight(1))
}))

l model2 = CRF.train(template, trainRdd2)
l result2 = model2.predict(test2.map(sentence => {
val tokens = sentence.split("\n")
Sequence(tokens.map(token => {
  val tags: Array[String] = token.split(' ')
  Token.put(tags)
}))
)
Building From Source
package
Building with breeze version 0.12

CRF-Spark built with breeze-0.12 improved dramatically on the performance as our experiments shown. The dependent breeze in Spark even in the newest release version 2.0.0 is still version 0.11.2 though. Luckily, the support to breeze-0.12 is added in a later commit. So if you want to try it, you could git clone the Spark upstream repo to build a Spark 2.1.0 SNAPSHOT by yourself. In addition, don't forget to change the dependent version of breeze from 0.11.2 to 0.12 in your build.sbt file.

Note that there seems be a difference (maybe a bug) of convergence check in breeze 0.12 when CRF-Spark do L-BFGS optimization. For example, in breeze 0.11.2, iterations stop at the oneOffImprovement's value 1E-3 to 1E-4 if the tolerance set by user is only 1E-3. But in in breeze 0.12, iterations always stop at 1E-6 with the same parameters. Thus, when you use breeze 0.12, please set the tolerance at a little higher value to prevent from computing too many iterations, or you can set the maxIterations what you need simply.

Contact & Feedback

If you encounter bugs, feel free to submit an issue or pull request. Also you can mail to:


This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.