IBM/r4ml-on-watson-studio

Name: r4ml-on-watson-studio

Owner: International Business Machines

Description: null

Created: 2018-05-14 15:30:40.0

Updated: 2018-05-14 22:07:20.0

Pushed: 2018-02-07 05:18:57.0

Homepage: null

Size: 720

Language: Jupyter Notebook

GitHub Committers

UserMost Recent Commit# Commits

Other Committers

UserEmailMost Recent Commit# Commits

README

Big Data Preparation and Exploration using R4ML

In this developer journey we will use R4ML, a scalable R package, running on IBM Data Science Experience (DSX) to perform various Machine Learning exercises.For those users who are unfamiliar with the Data Science Experience, DSX is an interactive, collaborative, cloud-based environment where data scientists, developers, and others interested in data science can use tools (e.g., RStudio, Jupyter Notebooks, Spark, etc.) to collaborate, share, and gather insight from their data.

When the reader has completed this journey, they will understand how to:

The Intended audience of this code pattern is data scientists, who wish to do the scalable feature engineering and data exploration. R4ML provides various out of the box tools, and preprocessing utility for doing the feature engineering. It also provides sampling utility to sample the data and do the exploratory analysis. This specific Code Pattern will provide end to end example to demonstate the ease and power of R4ML in implementing data preprocessing and data exploration. For more information about additional functionality support, documentation, and the roadmap, please vist R4ML

Source of data
Flow

  1. Load the provided notebook onto the IBM Data Science Experience platform.
  2. The notebook interacts with an Apache Spark instance.
  3. A sample big data dataset is loaded into the Jupyter Notebook.
  4. To perform machine data preprocessing and exploratory analysis, R4ML is used atop Apache Spark.

What problem does it solve for developers?

  1. Large-scale exploratory analytics and data preparation.
  2. Dimensionality reduction.
  3. How to use your favorite R utilities on big data.
  4. Highlights the steps necessary to complete data preparation and exploration.

Included Components

Featured Technologies
Analysis Section:
Scalable R4ML Key Features: Content

Steps

Follow these steps to setup and run this developer journey. These steps are described in detail below.

  1. Sign up for the Data Science Experience
  2. Create the notebook
  3. Run the notebook
  4. Save and share
1. Sign up for the Data Science Experience

Sign up for IBM's Data Science Experience. By signing up for the Data Science Experience, two services will be created in your Bluemix account: DSX-Spark and DSX-ObjectStore. If these services do not exist, or if you are already using them for some other application, you will need to create new instances.

To create these services:

Note: When creating your Object Storage service, select the Swift storage type in order to avoid having to pay an upgrade fee.

Take note of your service names as you will need to select them in the following steps.

2. Create the notebook

First you must create a new Project:

Create Notebook 1:

Create Notebook 2:

3. Run the notebook

When a notebook is executed, what is actually happening is that each code cell in the notebook is executed, in order, from top to bottom.

Each code cell is selectable and is preceded by a tag in the left margin. The tag format is In [x]:. Depending on the state of the notebook, the x can be:

There are several ways to execute the code cells in your notebook:

4. Save and share
How to save your work:

Under the File menu, there are several ways to save your notebook:

How to share your work:

You can share your notebook by selecting the ?Share? button located in the top right section of your notebook panel. The end result of this action will be a URL link that will display a ?read-only? version of your notebook. You have several options to specify exactly what you want shared from your notebook:

License

Apache 2.0


This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.