bigdatagenomics/mango-notebook

Name: mango-notebook

Owner: Big Data Genomics

Description: Notebook tools for Big Data Genomics. Apache 2 licensed.

Created: 2015-01-28 22:09:42.0

Updated: 2017-03-17 10:03:33.0

Pushed: 2015-03-01 08:29:02.0

Homepage:

Size: 5724

Language: JavaScript

GitHub Committers

UserMost Recent Commit# Commits

Other Committers

UserEmailMost Recent Commit# Commits

README

Spark Notebook

Gitter

Fork of the amazing scala-notebook, yet focusing on Massive Dataset Analysis using Apache Spark.

Description

The main intent of this tool is to create reproducible analysis using Scala, Apache Spark and more.

This is achieved through an interactive web-based editor that can combine Scala code, SQL queries, Markup or even JavaScript in a collaborative manner.

The usage of Spark comes out of the box, and is simply enabled by the implicit variable named sparkContext.

Launch
Using a release

Long story short, there are several ways to start the spark notebook quickly (even from scratch):

However, there are several flavors for these distribtions that depends on the Spark version and Hadoop version you are using.

ZIP

The zip distributions are publicly available in the bucket: s3://spark-notebook.

Checkout the needed version here.

Here is an example how to use it:

 https://s3.eu-central-1.amazonaws.com/spark-notebook/zip/spark-notebook-0.1.4-spark-1.2.0-hadoop-1.0.4.zip
p spark-notebook-0.1.4-spark-1.2.0-hadoop-1.0.4.zip
park-notebook-0.1.4-spark-1.2.0-hadoop-1.0.4
n/spark-notebook
Docker

If you're a Docker user, the following procedure will be even simpler!

Checkout the needed version here.

er pull andypetrella/spark-notebook:0.1.4-spark-1.2.0-hadoop-1.0.4
er run -p 9000:9000 andypetrella/spark-notebook:0.1.4-spark-1.2.0-hadoop-1.0.4
DEB

Using debian packages is one of the standard, hence the spark notebook is also available in this form (from v0.1.4):

 https://s3.eu-central-1.amazonaws.com/spark-notebook/deb/spark-notebook-0.1.4-spark-1.2.0-hadoop-1.0.4_all.deb
 dpkg -i spark-notebook-0.1.4-spark-1.2.0-hadoop-1.0.4.zip
 spark-notebook

Checkout the needed version here.

From the sources

The spark notebook requires a Java™ environment (aka JVM) as runtime and Play 2.2.6 to build it.

Of course, you will also need a working GIT installation to download the code and build it.

Procedure Download the code
clone https://github.com/andypetrella/spark-notebook.git
park-notebook
Launch the server

Enter the play console by running play within the spark-notebook folder:

o] Loading global plugins from /home/noootsab/.sbt/0.13/plugins
o] Loading project definition from /home/Sources/noootsab/spark-notebook/project
o] Set current project to spark-notebook (in build file:/home/Sources/noootsab/spark-notebook/)
   _
_ | | __ _ _  _
 \| |/ _' | || |
_/|_|\____|\__ /
           |__/

 2.2.6 built with Scala 2.10.3 (running Java 1.7.0_72), http://www.playframework.com

pe "help play" or "license" for more information.
pe "exit" or use Ctrl+D to leave this console.

rk-notebook] $

To create your distribution

rk-notebook] $ dist

In order to develop on the Spark Notebook, you'll have to use the run command instead.

Use

When the server has been started, you can head to the page http://localhost:9000 and you'll see something similar to: Notebook list

From there you can either:

In both case, the scala-notebook will open a new tab with your notebook in it, loaded as a web page.

Note: a notebook is a JSON file containing the layout and analysis blocks, and it's located within the project folder (with the snb extension). Hence, they can be shared and we can track their history in an SVM like GIT.

Features
Use/Reconfigure Spark

Since this project aims directly the usage of Spark, a SparkContext is added to the environment and can directly be used without additional effort.

Example using Spark

Spark will start with a regular/basic configuration. There are different ways to customize the embedded Spark to your needs.

Use the form

In order to adapt the configuration of the SparkContext, one can add the widget notebook.front.widgets.Spark. This widget takes the current context as only argument and will produce an HTML form that will allow manual and friendly changes to be applied.

So first, adding the widget in a cell,

rt notebook.front.widgets.Spark
Spark(sparkContext)

Run the cell and you'll get, Using SQL

It has two parts:

Submit the first part and the SparkContext will restart in the background (you can check the Spark UI to check if you like).

The reset function

The function reset is available in all notebooks: This function takes several parameters, but the most important one is lastChanges which is itself a function that can adapt the SparkConf. This way, we can change the master, the executor memory and a cassandra sink or whatever before restarting it. For more Spark configuration options see: [Spark Configuration](Spark Configuration)

In this example we reset SparkContext and add configuration options to use the [cassandra-connector]:

rt org.apache.spark.{Logging, SparkConf}
cassandraHost:String = "localhost"
t(lastChanges= _.set("spark.cassandra.connection.host", cassandraHost))

This makes Cassandra connector avaible in the Spark Context. Then you can use it, like so:

rt com.datastax.spark.connector._
kContext.cassandraTable("test_keyspace", "test_column_family")
Keep an eye on your tasks

Accessing the Spark UI is not always allowed or easy, hence a simple widget is available for us to keep a little eye on the stages running on the Spark cluster.

Luckily, it's fairly easy, just add this to the notebook:

rt org.apache.spark.ui.notebook.front.widgets.SparkInfo
rt scala.concurrent.duration._
SparkInfo(sparkContext, checkInterval=1 second, execNumber=Some(100))

This call will show and update a feedback panel tracking some basic (atm) metrics, in this configuration there will be one check per second, but will check only 100 times.

This can be tuned at will, for instance for an infinte checking, one can pass the None value to the argument execNumber.

Counting the words of a wikipedia dump will result in Showing progress

Using (Spark)SQL

Spark comes with this handy and cool feature that we can write some SQL queries rather than boilerplating with Scala or whatever code, with the clear advantage that the resulting DAG is optimized.

The spark-notebook offers SparkSQL support. To access it, we first we need to register an RDD as a table:

RDD.registerTempTable("data")

Now, we can play with SQL in two different ways, the static and the dynamic ones.

Static SQL

Then we can play with this data table like so:

 select col1 from data where col2 == 'thingy'

This will give access to the result via the resXYZ variable.

This is already helpful, but the resXYZ nummering can change and is not friendly, so we can also give a name to the result:

[col1Var] select col1 from data where col2 == 'thingy'

Now, we can use the variable col1Var wrapping a SchemaRDD.

This variable is reactive meaning that it react to the change of the SQL result. Hence in order to deal with the result, you can access its react function which takes two arguments:

The power of this reactivity is increased when we use SQL with dynamic parts.

Dynamic SQL

A dynamic SQL is looking like a static SQL but where specific tokens are used. Such tokens are taking the form: {type: variableName}.

When executing the command, the notebook will produce a form by generating on input for each dynamic part. See the show case below.

An example of such dynamic SQL is

[selectKids] SELECT name FROM people WHERE name = "{String: name}" and age >= {Int: age}

Which will create a form with to inputs, one text and on number.

When changing the value in the inputs, the SQL is compiled on the server and the result is printed on the notebook (Success, Failure, Bad Plan, etc.).

Again, the result is completely reactive, hence using the react function is mandatory to use the underlying SchemaRDD (when it becomes valid!).

Show case

This is how it looks like in the notebook:

Using SQL

Interacting with JavaScript

Showing numbers can be good but great analysis reports should include relevant charts, for that we need JavaScript to manipulate the notebook's DOM.

For that purpose, a notebook can use the Playground abstraction. It allows us to create data in Scala and use it in predefined JavaScript functions (located under assets/javascripts/notebook) or even JavaScript snippets (that is, written straight in the notebook as a Scala String to be sent to the JavaScript interpreter).

The JavaScript function will be called with these parameters:

Here is how this can be used, with a predefined consoleDir JS function (see here):

Simple Playground

Another example using the same predefined function and example to react on the new incoming data (more in further section). The new stuff here is the use of Codec to convert a Scala object into the JSON format used in JS:

Playground with custom data

Plotting with D3

Plotting with D3.js is rather common now, however it's not always simple, hence there is a Scala wrapper that brings the bootstrap of D3 in the mix.

These wrappers are D3.svg and D3.linePlot, and they are just a proof of concept for now. The idea is to bring Scala data to D3.js then create Coffeescript to interact with them.

For instance, linePlot is used like so:

Using Rickshaw

Note: This is subject to future change because it would be better to use playground for this purpose.

Timeseries with Rickshaw

Plotting timeseries is very common, for this purpose the spark notebook includes Rickshaw that quickly enables handsome timeline charts.

Rickshaw is available through Playground and a dedicated function for simple needs rickshawts.

To use it, you are only required to convert/wrap your data points into a dedicated Series object:

createTss(start:Long, step:Int=60*1000, nb:Int = 100):Seq[Series] = ...
data = createTss(orig, step, nb)

p = new Playground(data, List(Script("rickshawts",
                                      Json.obj(
                                        "renderer" ? "stack",
                                        "fixed" ? Json.obj(
                                                    "interval" ? (step/1000),
                                                    "max" ? 100,
                                                    "baseInSec" ? (orig/1000)
                                                  )
                                      ))))(seriesCodec)

As you can see, the only big deal is to create the timeseries (Seq[Series] which is a simple wrapper around:

Also, there are some options to tune the display:

Here is an example of the kind of result you can expect:

Using Rickshaw

Dynamic update of data and plot using Scala's Future

One of the very cool things that is used in the original scala-notebook is the use of reactive libs on both sides: server and client, combined with WebSockets. This offers a neat way to show dynamic activities like streaming data and so on.

We can exploit the reactive support to update Plot wrappers (the Playground instance actually) in a dynamic manner. If the JS functions are listening to the data changes they can automatically update their result.

The following example is showing how a timeseries plotted with Rickshaw can be regularly updated. We are using Scala Futures to simulate a server side process that would poll for a third-party service:

Update Timeseries Result

The results will be:

Update Timeseries Result

Update Notebook ClassPath

Keeping your notebook runtime updated with the libraries you need in the classpath is usually cumbersome as it requires updating the server configuration in the SBT definition and restarting the system. Which is pretty sad because it requires a restart, rebuild and is not contextual to the notebook!

Hence, a dedicated context has been added to the block, :cp which allows us to add specifiy local paths to jars that will be part of the classpath.

/home/noootsab/.m2/repository/joda-time/joda-time/2.4/joda-time-2.4.jar

Or even


/scala-notebook/repo/com/codahale/metrics/metrics-core/3.0.2/metrics-core-3.0.2.jar
/scala-notebook/repo/org/scala-lang/scala-compiler/2.10.4/scala-compiler-2.10.4.jar
/scala-notebook/repo/org/scala-lang/scala-library/2.10.4/scala-library-2.10.4.jar
/scala-notebook/repo/joda-time/joda-time/2.3/joda-time-2.3.jar
/scala-notebook/repo/commons-logging/commons-logging/1.1.1/commons-logging-1.1.1.jar
/scala-notebook/repo/com/datastax/cassandra/cassandra-driver-core/2.0.4/cassandra-driver-core-2.0.4.jar
/scala-notebook/repo/org/apache/thrift/libthrift/0.9.1/libthrift-0.9.1.jar
/scala-notebook/repo/org/apache/httpcomponents/httpcore/4.2.4/httpcore-4.2.4.jar
/scala-notebook/repo/org/joda/joda-convert/1.2/joda-convert-1.2.jar
/scala-notebook/repo/org/scala-lang/scala-reflect/2.10.4/scala-reflect-2.10.4.jar
/scala-notebook/repo/org/apache/cassandra/cassandra-clientutil/2.0.9/cassandra-clientutil-2.0.9.jar
/scala-notebook/repo/org/slf4j/slf4j-api/1.7.2/slf4j-api-1.7.2.jar
/scala-notebook/repo/com/datastax/cassandra/cassandra-driver-core/2.0.4/cassandra-driver-core-2.0.4-sources.jar
/scala-notebook/repo/io/netty/netty/3.9.0.Final/netty-3.9.0.Final.jar
/scala-notebook/repo/org/apache/commons/commons-lang3/3.3.2/commons-lang3-3.3.2.jar
/scala-notebook/repo/commons-codec/commons-codec/1.6/commons-codec-1.6.jar
/scala-notebook/repo/org/apache/httpcomponents/httpclient/4.2.5/httpclient-4.2.5.jar
/scala-notebook/repo/org/apache/cassandra/cassandra-thrift/2.0.9/cassandra-thrift-2.0.9.jar
/scala-notebook/repo/com/datastax/spark/spark-cassandra-connector_2.10/1.1.0-alpha1/spark-cassandra-connector_2.10-1.1.0-alpha1.jar
/scala-notebook/repo/com/google/guava/guava/15.0/guava-15.0.jar

Here is what it'll look like in the notebook:

Simple Classpath

Update Spark dependencies (spark.jars)

So you use Spark, hence you know that it's not enough to have the jars locally added to the Driver's classpath.

Indeed, workers needs to have them in their classpath. One option would be to update the list of jars (spark.jars property) provided to the SparkConf using the reset function.

However, this can be very tricky when we need to add jars that have themselves plenty of dependencies.

However, there is another context available to update both the classpath on the notebook and in Spark


oup1 % artifact1 % version1
oup2 % artifact2 % version2
p3 % artifact3 % version3
oup4 % artifact4 % version4

oup5 % artifact5 % version5
oup6 % artifact6 % version6
oup7 % artifact7 % version7

So this is simple:

The jars will be fetched in a temporary repository (that can be hardcoded using :local-repo).

Then they'll be added to the Spark's jars property, before restarting the context.

For example, if you want to use ADAM, all you need to do is:

org.bdgenomics.adam % adam-apis % 0.15.0
g.apache.hadoop % hadoop-client %   _
g.apache.spark  %     _         %   _
g.scala-lang    %     _         %   _
g.scoverage     %     _         %   _

In live, you can check the notebook named Update classpath and Spark's jars, which looks like this:

Spark Jars

IMPORTANT

Some vizualizations (wisp) are currently using Highcharts which is not available for commercial or private usage!

If you're in this case, please to contact me first.


This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.