Name: r4ml-on-watson-studio
Owner: International Business Machines
Description: null
Created: 2018-05-14 15:30:40.0
Updated: 2018-05-14 22:07:20.0
Pushed: 2018-02-07 05:18:57.0
Homepage: null
Size: 720
Language: Jupyter Notebook
GitHub Committers
User | Most Recent Commit | # Commits |
---|
Other Committers
User | Most Recent Commit | # Commits |
---|
In this developer journey we will use R4ML, a scalable R package, running on IBM Data Science Experience (DSX) to perform various Machine Learning exercises.For those users who are unfamiliar with the Data Science Experience, DSX is an interactive, collaborative, cloud-based environment where data scientists, developers, and others interested in data science can use tools (e.g., RStudio, Jupyter Notebooks, Spark, etc.) to collaborate, share, and gather insight from their data.
When the reader has completed this journey, they will understand how to:
The Intended audience of this code pattern is data scientists, who wish to do the scalable feature engineering and data exploration. R4ML provides various out of the box tools, and preprocessing utility for doing the feature engineering. It also provides sampling utility to sample the data and do the exploratory analysis. This specific Code Pattern will provide end to end example to demonstate the ease and power of R4ML in implementing data preprocessing and data exploration. For more information about additional functionality support, documentation, and the roadmap, please vist R4ML
We use the Airline On-Time Statistics and Delay Causes from RITA A 1% sample of the “airline” dataset available at http://stat-computing.org/dataexpo/2009/the-data.html This data originally comes from RITA (http://www.rita.dot.gov) and is in the public domain.
For this example, we will use, a subset of above dataset, which is shipped with R4ML
User can use the bigger dataset from RITA and our code will work with that.
R4ML is a git downloadable open-source R package from IBM
Created on top of SparkR and Apache SystemML (so it supports features from both)
Acts as a R bridge between SparkR and Apache SystemML
Provides a collection of canned algorithms
Provides the ability to create custom ML algorithms
Provides both SparkR and Apache SystemML functionality
APIs are friendlier to the R user
We will first load the package and data and do the initial transformation and various feature engineering
We will sample the dataset and use the powerful ggplot2 library from R to do various exploratory analysis
In the end, we will run PCA to reduce the dimension of the dataset and select the k components to cover 90% of variance
More details are in the notebooks
Follow these steps to setup and run this developer journey. These steps are described in detail below.
Sign up for IBM's Data Science Experience.
By signing up for the Data Science Experience, two services will be created in your Bluemix account: DSX-Spark
and DSX-ObjectStore
. If these services do not exist, or if you are already using them for some other application, you will need to create new instances.
To create these services:
DSX-Spark
so that you can keep track of it.DSX-ObjectStorage
so that you can keep track of it.Note: When creating your Object Storage service, select the
Swift
storage type in order to avoid having to pay an upgrade fee.
Take note of your service names as you will need to select them in the following steps.
First you must create a new Project:
Get Started
tab at the top or scroll down to Recently updated projects
.New project
under Recently updated projects
.Name
and optional Description
.Spark Service
, select your Apache Spark service name.Storage Type
, select the Object Storage (Swift API)
option.Target Object Storage Instance
, select your Object Storage service name.Create
.Create Notebook 1:
add notebooks
.From URL
and enter a Name
and optional Description
.Notebook URL
enter: https://github.com/aloknsingh/ibm_r4ml_biganalytics/notebooks/R4ML_Introduction_Exploratory_DataAnalysis.ipynbSpark Service
, select your Apache Spark service name.Create Notebook
.Create Notebook 2:
add notebooks
.From URL
and enter a Name
and optional Description
.Notebook URL
enter: https://github.com/aloknsingh/ibm_r4ml_biganalytics/notebooks/R4ML_Data_Preprocessing_and_Dimension_Reduction.ipynbSpark Service
, select your Apache Spark service name.Create Notebook
.When a notebook is executed, what is actually happening is that each code cell in the notebook is executed, in order, from top to bottom.
Each code cell is selectable and is preceded by a tag in the left margin. The tag
format is In [x]:
. Depending on the state of the notebook, the x
can be:
*
, which indicates that the cell is currently executing.There are several ways to execute the code cells in your notebook:
Play
button in the toolbar.Cell
menu bar, there are several options available. For example, you
can Run All
cells in your notebook, or you can Run All Below
, that will
start executing from the first cell under the currently selected cell, and then
continue executing all cells that follow.Schedule
button located in the top right section of your notebook
panel. Here you can schedule your notebook to be executed once at some future
time, or repeatedly at your specified interval.Under the File
menu, there are several ways to save your notebook:
Save
will simply save the current state of your notebook, without any version
information.Save Version
will save your current state of your notebook with a version tag
that contains a date and time stamp. Up to 10 versions of your notebook can be
saved, each one retrievable by selecting the Revert To Version
menu item.You can share your notebook by selecting the ?Share? button located in the top right section of your notebook panel. The end result of this action will be a URL link that will display a ?read-only? version of your notebook. You have several options to specify exactly what you want shared from your notebook:
Only text and output
: will remove all code cells from the notebook view.All content excluding sensitive code cells
: will remove any code cells
that contain a sensitive tag. For example, # @hidden_cell
is used to protect
your dashDB credentials from being shared.All content, including code
: displays the notebook as is.download as
options are also available in the menu.