Name: rsparkling
Owner: H2O.ai
Description: RSparkling: Use H2O Sparkling Water from R (Spark + R + Machine Learning)
Created: 2016-12-07 00:14:36.0
Updated: 2018-05-23 21:01:15.0
Pushed: 2018-05-22 20:48:18.0
Homepage: http://spark.rstudio.com/guides/h2o/
Size: 365
Language: R
GitHub Committers
User | Most Recent Commit | # Commits |
---|
Other Committers
User | Most Recent Commit | # Commits |
---|
The rsparkling R package is an extension package for sparklyr that creates an R front-end for the Sparkling Water package from H2O. This provides an interface to H2O's high performance, distributed machine learning algorithms on Spark, using R.
This package implements basic functionality (creating an H2OContext, showing the H2O Flow interface, and converting between Spark DataFrames and H2O Frames). The main purpose of this package is to provide a connector between sparklyr and H2O's machine learning algorithms.
The rsparkling package uses sparklyr for Spark job deployment and initialization of Sparkling Water. After that, user can use the regular h2o R package for modeling.
The rsparkling R package requires the sparklyr and h2o R packages to run, so below we show how to install each of these packages.
We recommend the latest stable version of sparklyr.
all.packages("sparklyr")
The sparklyr package makes it easy to install any particular version of Spark. Prior to installing h2o and rsparkling, the user will need to decide which version of Spark they would like to work with, as the remaining installation revolve around a particular major version of Spark (2.1, 2.2 or 2.3).
The following command will install Spark 2.3.5:
ary(sparklyr)
k_install(version = "")
NOTE: The previous command requires access to the internet. If you are not connected to the internet/behind a firewall you would need to do the following:
Download Spark (Pick the major version that corresponds to Sparkling Water)
Unzip Spark files
Set the SPARK_HOME
environment variable to the location of the downloaded Spark folder in R as follows:
setenv(SPARK_HOME="/path/to/spark")
rsparkling currently requires that a certain version of H2O be used, depending on which major version of Spark is used, although this requirement will be relaxed in a future version. Each release of Sparking Water is built from specific versions of H2O, and those versions are listed in the table below.
rsparkling will automatically use the latest Sparkling Water based on the major Spark version provided.
Advanced users may want to choose a particular Sparking Water / H2O version (specific Sparkling Water versions must match specific Spark and H2O versions). Refer to integration info below.
| Spark Version | Sparkling Water Version | H2O Version | H2O Release Name | H2O Release Patch Number | | ————- | ———————– | ———– | —————- | —————— | | 2.3. | 2.3.5 | 3.18.0.10 | “rel-wolpert” | “10” | | | 2.3.4 | 3.18.0.9 | “rel-wolpert” | “9” | | | 2.3.3 | 3.18.0.9 | “rel-wolpert” | “9” | | | 2.3.2 | 3.18.0.8 | “rel-wolpert” | “8” | | | 2.3.1 | 3.18.0.7 | “rel-wolpert” | “7” | | | 2.3.0 | 3.18.0.5 | “rel-wolpert” | “5” | | | | | | | | 2.2. | 2.2.16 | 3.18.0.10 | “rel-wolpert” | “10” | | | 2.2.15 | 3.18.0.9 | “rel-wolpert” | “9” | | | 2.2.14 | 3.18.0.9 | “rel-wolpert” | “9” | | | 2.2.13 | 3.18.0.8 | “rel-wolpert” | “8” | | | 2.2.12 | 3.18.0.7 | “rel-wolpert” | “7” | | | 2.2.11 | 3.18.0.5 | “rel-wolpert” | “5” | | | 2.2.10 | 3.18.0.4 | “rel-wolpert” | “4” | | | 2.2.9 | 3.18.0.2 | “rel-wolpert” | “2” | | | 2.2.8 | 3.18.0.1 | “rel-wolpert” | “1” | | | 2.2.7 | 3.16.0.4 | “rel-wheeler” | “4” | | | 2.2.6 | 3.16.0.2 | “rel-wheeler” | “2” | | | 2.2.5 | 3.16.0.2 | “rel-wheeler” | “2” | | | 2.2.4 | 3.16.0.2 | “rel-wheeler” | “2” | | | 2.2.3 | 3.16.0.1 | “rel-wheeler” | “1” | | | 2.2.2 | 3.14.0.7 | “rel-weierstrass”| “7” | | | 2.2.1 | 3.14.0.6 | “rel-weierstrass”| “6” | | | 2.2.0 | 3.14.0.2 | “rel-weierstrass”| “2” | | | | | | | | 2.1.* | 2.1.30 | 3.18.0.10 | “rel-wolpert” | “10” | | | 2.1.29 | 3.18.0.9 | “rel-wolpert” | “9” | | | 2.1.28 | 3.18.0.9 | “rel-wolpert” | “9” | | | 2.1.27 | 3.18.0.8 | “rel-wolpert” | “8” | | | 2.1.26 | 3.18.0.7 | “rel-wolpert” | “7” | | | 2.1.25 | 3.18.0.5 | “rel-wolpert” | “5” | | | 2.1.24 | 3.18.0.4 | “rel-wolpert” | “4” | | | 2.1.23 | 3.18.0.2 | “rel-wolpert” | “2” | | | 2.1.22 | 3.18.0.1 | “rel-wolpert” | “1” | | | 2.1.21 | 3.16.0.4 | “rel-wheeler” | “4” | | | 2.1.20 | 3.16.0.2 | “rel-wheeler” | “2” | | | 2.1.19 | 3.16.0.2 | “rel-wheeler” | “2” | | | 2.1.18 | 3.16.0.2 | “rel-wheeler” | “2” | | | 2.1.17 | 3.16.0.1 | “rel-wheeler” | “1” | | | 2.1.16 | 3.14.0.7 | “rel-weierstrass”| “7” | | | 2.1.15 | 3.14.0.6 | “rel-weierstrass”| “6” | | | 2.1.14 | 3.14.0.2 | “rel-weierstrass”| “2” | | | 2.1.13 | 3.10.5.4 | “rel-vajda” | “4” | | | 2.1.12 | 3.10.5.4 | “rel-vajda” | “4” | | | 2.1.11 | 3.10.5.3 | “rel-vajda” | “3” | | | 2.1.10 | 3.10.5.2 | “rel-vajda” | “2” | | | 2.1.9 | 3.10.5.1 | “rel-vajda” | “1” | | | 2.1.8 | 3.10.4.8 | “rel-ueno” | “8” | | | 2.1.7 | 3.10.4.7 | “rel-ueno” | “7” | | | 2.1.6 | 3.10.4.7 | “rel-ueno” | “7” | | | 2.1.5 | 3.10.4.6 | “rel-ueno” | “6” | | | 2.1.4 | 3.10.4.5 | “rel-ueno” | “5” | | | 2.1.3 | 3.10.4.3 | “rel-ueno” | “3” | | | 2.1.2 | 3.10.4.2 | “rel-ueno” | “2” | | | 2.1.1 | 3.10.4.2 | “rel-ueno” | “2” | | | 2.1.0 | 3.10.3.2 | “rel-tverberg” | “2” |
NOTE: A call to h2o_release_table()
will display the above table in your R console and return a data.frame containing this information.
To install any one of the above versions, we recommend using the H2O hosted repository on S3. In future versions of rsparkling, all Sparkling Water compatible versions of H2O will be available on CRAN and will be able to be easily installed using the versions R package using a command such as versions::install.packages("h2o", "3.18.0.10")
.
At present, you can install the h2o R package using a repository URL comprised of the H2O version name and number. Example: http://h2o-release.s3.amazonaws.com/h2o/rel-wolpert/10/R
The R code below will install the most recent Spark 2.3 compatible release of H2O, which is “rel-wolpert” patch 10 (aka H2O version 3.18.0.10).
e following two commands remove any previously installed H2O packages for R.
"package:h2o" %in% search()) { detach("package:h2o", unload=TRUE) }
"h2o" %in% rownames(installed.packages())) { remove.packages("h2o") }
xt, we download packages that H2O depends on.
<- c("methods","statmod","stats","graphics","RCurl","jsonlite","tools","utils")
(pkg in pkgs) {
if (! (pkg %in% rownames(installed.packages()))) { install.packages(pkg) }
w we download, install, and initialize the H2O package for R.
this case we are using rel-wolpert 5 (3.18.0.5).
all.packages("h2o", type = "source", repos = "http://h2o-release.s3.amazonaws.com/h2o/rel-wolpert/10/R")
The latest stable version of rsparkling on CRAN can be installed as follows:
all.packages("rsparkling")
Alternatively, the development version can be installed from the “master” branch as follows:
ary(devtools)
ools::install_github("h2oai/rsparkling", ref = "master")
If a particular version of Sparkling Water is desired/required, you can specify a specific Sparkling Water version by making a call to options(rsparkling.sparklingwater.version = ...)
, which will globally set up a specific Sparkling Water version.
NOTE:
If you do not set rsparkling.sparklingwater.version
, then the latest version of Sparkling Water will be used based on the version of Spark installed.
NOTE:
If you would like to use a custom Sparkling Water jar, then you need to call the following:
options(rsparkling.sparklingwater.location = "path/to/sparkling_water.jar")
.
This will be the version of Sparkling Water that will be called in the library(rsparkling)
command, and thus you should set the option before loading the library.
ons(rsparkling.sparklingwater.version = "2.3.5") # Using Sparkling Water 2.3.5
ary(rsparkling)
NOTE: The previous command requires access to the internet. If you are not connected to the internet/behind a firewall you would need to do the following:
Download the Sparkling Water jar of your choice based on the integration table above. To do this go to the following link where [SW Major Version]
is the major version of Sparkling Water you wish to use, i.e., 2.3
and [SW Minor Version]
is the minor version of Sparkling Water you wish to use, i.e., 5
.
://h2o-release.s3.amazonaws.com/sparkling-water/rel-[SW Major Version]/[SW Minor Version]/index.html
Click the DOWNLOAD SPARKLING WATER
tab, which will download a .zip
file of Sparkling Water.
Run the following command to unzip the folder:
p sparkling-water-[SW Major Version].[SW Minor Version].zip
The path to the Sparkling Water jar file is: sparkling-water-[SW Major Version].[SW Minor Version]/assembly/build/libs/sparkling-water-assembly_*.jar
.
The following command will now call the Sparkling Water jar:
ons(rsparkling.sparklingwater.location = "path/to/sparkling-water-assembly_*.jar")
ary(rsparkling)
Once we've installed rsparkling and it's dependencies, the first step would be to create a Spark connection as follows:
- spark_connect(master = "local", version = "2.3.5")
NOTE: Please be sure to set version
to the proper Spark version utilized by your version of Sparkling Water in spark_connect()
NOTE: The previous command requires access to the internet. If you are not connected to the internet/behind a firewall you would need to do the following:
Download Spark (Pick the major version that corresponds to Sparkling Water)
Unzip Spark files
Set the SPARK_HOME
environment variable to the location of the downloaded Spark folder in R as follows:
setenv(SPARK_HOME="/path/to/spark")
Note, the spark_home
parameter in spark_connect
defaults to the SPARK_HOME
environment variable. If SPARK_HOME
is defined it will be always be used unless the version
parameter is specified to force the use of a locally installed version.
Run the following to create a Spark connection using the default IP and port:
- spark_connect(master = "local")
RSparkling does not expose setters and getters for specifying configuration options. You must specify the Spark configuration options directly, for example:
ig=spark_config()
ig=c(config,list("spark.ext.h2o.node.port.base"="55555", "spark.ext.h2o.client.port.base"="44444"))
- spark_connect(master="yarn-client",app_name = "sparklyr",config = config )
In the above, spark.ext.h2o.node.port.base
affects the worker nodes, and spark.ext.h2o.client.port.base
affects the client.
The call to library(rsparkling)
automatically registered the Sparkling Water extension, which in turn specified that the Sparkling Water Spark package should be made available for Spark connections. Let's inspect the H2OContext
for our Spark connection:
context(sc)
## <jobj[6]>
## class org.apache.spark.h2o.H2OContext
##
## Sparkling Water Context:
## * H2O name: sparkling-water-jjallaire_-1482215501
## * number of executors: 1
## * list of used executors:
## (executorId, host, port)
## ------------------------
## (driver,localhost,54323)
## ------------------------
##
## Open H2O Flow in browser: http://127.0.0.1:54323 (CMD + click in Mac OSX)
##
We can also view the H2O Flow web UI:
flow(sc)
As an example, let's copy the mtcars dataset to to Spark so we can access it from H2O Sparkling Water:
ary(dplyr)
rs_tbl <- copy_to(sc, mtcars, overwrite = TRUE)
rs_tbl
## Source: query [?? x 11]
## Database: spark connection master=local[8] app=sparklyr local=TRUE
##
## mpg cyl disp hp drat wt qsec vs am gear carb
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## 2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## 3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## 4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## 5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## 6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## 7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## 8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## 9 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## 10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## # ... with more rows
The use case we'd like to enable is calling the H2O algorithms and feature transformers directly on Spark DataFrames that we've manipulated with dplyr. This is indeed supported by the Sparkling Water package. Here is how you convert a Spark DataFrame into an H2O Frame:
rs_hf <- as_h2o_frame(sc, mtcars_tbl)
rs_hf
## <jobj[103]>
## class water.fvec.H2OFrame
## Frame frame_rdd_39 (32 rows and 11 cols):
## mpg cyl disp hp drat wt qsec vs am gear carb
## min 10.4 4 71.1 52 2.76 1.513 14.5 0 0 3 1
## mean 20.090625 6 230.721875 146 3.5965625 3.21725 17.848750000000003 0 0 3 2
## stddev 6.026948052089104 1 123.93869383138194 68 0.5346787360709715 0.9784574429896966 1.7869432360968436 0 0 0 1
## max 33.9 8 472.0 335 4.93 5.424 22.9 1 1 5 8
## missing 0.0 0 0.0 0 0.0 0.0 0.0 0 0 0 0
## 0 21.0 6 160.0 110 3.9 2.62 16.46 0 1 4 4
## 1 21.0 6 160.0 110 3.9 2.875 17.02 0 1 4 4
## 2 22.8 4 108.0 93 3.85 2.32 18.61 1 1 4 1
## 3 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## 4 18.7 8 360.0 175 3.15 3.44 17.02 0 0 3 2
## 5 18.1 6 225.0 105 2.76 3.46 20.22 1 0 3 1
## 6 14.3 8 360.0 245 3.21 3.57 15.84 0 0 3 4
## 7 24.4 4 146.7 62 3.69 3.19 20.0 1 0 4 2
## 8 22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2
## 9 19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4
## 10 17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4
## 11 16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3
## 12 17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3
## 13 15.2 8 275.8 180 3.07 3.78 18.0 0 0 3 3
## 14 10.4 8 472.0 205 2.93 5.25 17.98 0 0 3 4
## 15 10.4 8 460.0 215 3.0 5.424 17.82 0 0 3 4
## 16 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
## 17 32.4 4 78.7 66 4.08 2.2 19.47 1 1 4 1
## 18 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## 19 33.9 4 71.1 65 4.22 1.835 19.9 1 1 4 1
Using the same mtcars dataset, here is an example where we train a Gradient Boosting Machine (GBM) to predict “mpg”.
First, we do a library call to h2o:
ary(h2o)
Define the response, y
, and set of predictor variables, x
:
"mpg"
setdiff(names(mtcars_hf), y)
Let's split the data into a train and test set using H2O. The h2o.splitFrame
function defaults to a 75-25 split (ratios = 0.75
), but here we will make a 70-30 train-test split:
lit the mtcars H2O Frame into train & test sets
ts <- h2o.splitFrame(mtcars_hf, ratios = 0.7, seed = 1)
Now train an H2O GBM using the training H2OFrame.
<- h2o.gbm(x = x,
y = y,
training_frame = splits[[1]],
min_rows = 1,
seed = 1)
t(fit)
l Details:
==========
egressionModel: gbm
l ID: GBM_model_R_1474763476171_1
l Summary:
mber_of_trees number_of_internal_trees model_size_in_bytes min_depth
50 50 14807 5
x_depth mean_depth min_leaves max_leaves mean_leaves
5 5.00000 17 21 18.64000
egressionMetrics: gbm
eported on training data. **
0.001211724
: 0.03480983
0.02761402
E: 0.001929304
Residual Deviance : 0.001211724
We can evaluate the performance of the GBM by evaluating its performance on a test set.
<- h2o.performance(fit, newdata = splits[[2]])
t(perf)
egressionMetrics: gbm
2.707001
: 1.645297
1.455267
E: 0.08579109
Residual Deviance : 2.707001
To generate predictions on a test set, you do the following. This will return an H2OFrame with a single (or multiple) columns of predicted values. If regression, it will be a single colum, if binary classification it will be 3 columns and in multi-class prediction it will be C+1 columns (where C is the number of classes).
_hf <- h2o.predict(fit, newdata = splits[[2]])
(pred_hf)
redict
.39512
.92804
.19558
.47695
.47695
.24433
Now let's say you want to make this H2OFrame available to Spark. You can convert an H2OFrame into a Spark DataFrame using the as_spark_dataframe
function:
_sdf <- as_spark_dataframe(sc, pred_hf)
(pred_sdf)
ce: query [?? x 1]
base: spark connection master=local[8] app=sparklyr local=TRUE
redict
<dbl>
.39512
.92804
.19558
.47695
.47695
.24433
If you are new to H2O for machine learning, we recommend you start with the Intro to H2O Tutorial, followed by the H2O Grid Search & Model Selection Tutorial. There are a number of other H2O R tutorials and demos available, as well as the H2O World 2015 Training Gitbook, and the Machine Learning with R and H2O Booklet (pdf).
Look at the Spark log from R:
k_log(sc, n = 100)
Now we disconnect from Spark, this will result in the H2OContext being stopped as well since it's owned by the spark shell process used by our Spark connection:
k_disconnect(sc)