CD2H gitForager

intel-analytics/StatisticsOnSpark

Name: StatisticsOnSpark

Owner: intel-analytics

Description: Assembly of fundamental statistics implemented based on Apache Spark

Created: 2015-12-29 06:17:04.0

Updated: 2017-12-19 14:40:40.0

Pushed: 2016-02-11 16:00:25.0

Homepage: null

Size: 19

Language: Scala

GitHub Committers

User	Most Recent Commit	# Commits

Other Committers

User	Email	Most Recent Commit	# Commits

README

Spark.statistics

Assembly of fundamental statistics implemented based on Apache Spark

Requirements

This documentation is for Spark 1.3+. Other version will probably work yet not tested.

Features

Spark.statistics intends to provide fundamental statistics functions.

Currently we support:

One Sample T Test,
Independent Samples T Test
Paired Samples T Test
One way ANOVA

Hopefully more features will come in quickly, next on the list:

Post Hoc comparison
Log likelihood
Kolmogorov-Smirnov

Example

Scala API

val sample1 = Array(100d, 200d, 300d, 400d)
val sample2 = Array(101d, 205d, 300d, 400d)

val rdd1 = sc.parallelize(sample1)
val rdd2 = sc.parallelize(sample2)

new TwoSampleIndependentTTest().tTest(rdd1, rdd2, 0.05))
new TwoSampleIndependentTTest().tTest(rdd1, rdd2)

This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.