adobe/ml-featurizer

Name: ml-featurizer

Owner: Adobe Systems Incorporated

Description: null

Created: 2018-05-02 21:16:10.0

Updated: 2018-05-19 10:03:04.0

Pushed: 2018-05-05 06:27:57.0

Homepage: null

Size: 21

Language: Scala

GitHub Committers

UserMost Recent Commit# Commits

Other Committers

UserEmailMost Recent Commit# Commits

README

ML Featurizer

Feature engineering is a difficult and time consuming process. ML Featurizer is a library to enable users to create additional features from raw data with ease. It extends and enriches the existing Spark's feature engineering functionality.

Featurizers provided by the library
  1. Unary Temporal Featurizers
    • DayOfWeekFeaturizer
    • HourOfDayFeaturizer
    • MonthOfYearFeaturizer
    • PartsOfDayFeaturizer
    • WeekendFeaturizer
  2. Unary Numeric Featurizers
    • LogTransformFeaturizer
    • MathFeaturizer
    • PowerTransformFeaturizer
  3. Binary Temporal Featurizers
    • DateDiffFeaturizer
  4. Binary Numeric Featurizers
    • AdditionFeaturizer
    • DivisionFeaturizer
    • MultiplicationFeaturizer
    • SubtractionFeaturizer
  5. Binary String Featurizers
    • ConcateColumnsFeaturizer
  6. Grouping Featurizers
    • GroupByFeaturizer (min, max, count, avg, sum)
Examples: Create day of week feature
ct DayOfWeekFeaturizerExample {
f main(args: Array[String]): Unit = {
val spark = SparkSession.builder().appName("DayOfWeekFeaturizer").master("local").getOrCreate()

val data = Array((0, "2018-01-02"),
  (1, "2018-02-02"),
  (2, "2018-03-02"),
  (3, "2018-04-05"),
  (3, "2018-05-05"))
val dataFrame = spark.createDataFrame(data).toDF("id", "date")

val featurizer = new DayOfWeekFeaturizer()
  .setInputCol("date")
  .setOutputCol("dayOfWeek")
  .setFormat("yyyy-MM-dd")

val featurizedDataFrame = featurizer.transform(dataFrame)
featurizedDataFrame.show()


Use featurizers in Spark ML Pipeline
ct FeaturePipeline {
f main(args: Array[String]): Unit = {
val spark = SparkSession.builder().appName("FeaturePipeline").master("local").getOrCreate()

val data = Array((0, "2018-01-02", 1.0, 2.0, "mercedes"),
  (1, "2018-02-02", 2.5, 3.5, "lexus"),
  (2, "2018-03-02", 5.0, 1.0, "toyota"),
  (3, "2018-04-05", 8.0, 9.0, "tesla"),
  (4, "2018-05-05", 1.0, 5.0, "bmw"),
  (4, "2018-05-05", 1.0, 5.0, "bmw"))
val dataFrame = spark.createDataFrame(data).toDF("id", "date", "price1", "price2", "brand")

val dayOfWeekfeaturizer = new DayOfWeekFeaturizer()
  .setInputCol("date")
  .setOutputCol("dayOfWeek")
  .setFormat("yyyy-MM-dd")

val monthOfYearfeaturizer = new MonthOfYearFeaturizer()
  .setInputCol("date")
  .setOutputCol("monthOfYear")
  .setFormat("yyyy-MM-dd")

val weekendFeaturizer = new WeekendFeaturizer()
  .setInputCol("date")
  .setOutputCol("isWeekend")
  .setFormat("yyyy-MM-dd")

val additionFeaturizer = new AdditionFeaturizer()
  .setInputCols("price1", "price2")
  .setOutputCol("price1_add_price2")

val indexer = new StringIndexer()
  .setInputCol("brand")
  .setOutputCol("brandIndex")

val encoder = new OneHotEncoder()
  .setInputCol("brandIndex")
  .setOutputCol("brandVector")

val pipeline = new Pipeline()
  .setStages(Array(dayOfWeekfeaturizer, monthOfYearfeaturizer, weekendFeaturizer, additionFeaturizer,
    indexer, encoder))
val model = pipeline.fit(dataFrame)
model.transform(dataFrame).show()


References:

This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.