IBM/graph-db-insights

Name: graph-db-insights

Owner: International Business Machines

Description: Get insights from OrientDB database using PyOrient through IBM Watson Studio

Created: 2017-08-16 14:04:29.0

Updated: 2018-03-22 17:31:29.0

Pushed: 2018-03-22 17:31:31.0

Homepage: https://developer.ibm.com/code/patterns/store-graph-and-derive-insights-from-interconnected-data

Size: 4456

Language: Jupyter Notebook

GitHub Committers

UserMost Recent Commit# Commits

Other Committers

UserEmailMost Recent Commit# Commits

README

Get Insights from OrientDB database using PyOrient through IBM Watson Studio

Data Science Experience is now Watson Studio. Although some images in this code pattern may show the service as Data Science Experience, the steps and processes will still work.

This journey gives you a head start on how to work with graphs in OrientDB through IBM Watson Studio using PyOrient module - a python driver for OrientDB to operate on data and to get insights from OrientDB. IBM Watson Studio can be used to analyze data using Jupyter notebooks.

OrientDB is a multi-model database, supporting graph, document, key/value, and object models, but the relationships are managed as in graph databases with direct connections between records. Graph databases are well-suited for analysing interconnections like to mine data from social media. It is also useful for working with data in business disciplines that involve complex relationships and dynamic schema and creating recommendations like “customers who bought this also looked at…“. This journey will help you to understand end-to-end flow starting from downloading the data-set, cleansing of data, extract entities and relations from the data-set, connect with OrientDB, create a new OrientDB database, populate database with node classes, edge classes, vertices, relations and then execute queries to get insights from the data in OrientDB database. OrientDB have extended SQL to provide support for graph traversal in graph database making it easy for developers familiar with SQL to start exploring graph database for their business needs.

In this journey we will demonstrate:

To achieve this, OrientDB instance is created on the Kubernetes Cluster and then it is accessed through IBM Watson Studio. This journey will help developers to get started with various OrientDB operations like CRUD, basic traversal and extracting insights using PyOrient on IBM Watson Studio.

When the reader has completed this journey, they will understand how to:

  1. The developer sets up the Kubernetes cluster using Kubernetes service on IBM Cloud.
  2. The OrientDB instance is deployed on the Kubernetes cluster created by the developer in the first step with persistent volume, exposing the ports(2424, 2480) used by OrientDB on bluemix.
  3. The developer creates a Jupyter notebook on the IBM Watson Studio powered by spark. While creation of notebook, an instance of Object Storage is attached to the notebook for storing the data used by the notebook.
  4. The developer uploads the configuration file (config.json) and the dataset (graph-insights.csv) in the object storage.
  5. The credentials of Object Storage are updated in the notebook and the files from Object Storage are loaded to create graph from them in OrientDB.
  6. The notebook communicates with the OrientDB through PyOrient driver. And various operations are performed on the OrientDB using functions written in the Jupyter notebook.
Included components
Featured technologies
Prerequisite

Create a Kubernetes cluster with IBM Cloud Container Service to deploy in cloud. Deploy OrientDB on Kubernetes Cluster using Deploy OrientDB on Kubernetes.

Watch the Video

Watch this video to get an overview of this developer Journey.

Steps

Follow these steps to setup and run this developer journey. The steps are described in detail below.

  1. Deploy OrientDB on Kubernetes Cluster
  2. Sign up for the Watson Studio
  3. Create the notebook
  4. Add the data
  5. Update the notebook with service credentials
  6. Flow of the notebook
  7. Run the notebook
  8. Analyze the results
1. Deploy OrientDB on Kubernetes Cluster

Deploy OrientDB on Kubernetes cluster using Deploy OrientDB on Kubernetes. It will expose the ports on IBM Cloud through which OrientDB can be accessed from the Jupyter notebook on IBM Watson Studio. Use the ip-address of your cluster and node port port 2424 on which the OrientDB console is mapped, to access that OrientDB through Jupyter notebook.

2. Sign up for Watson Studio

Sign up for IBM's Watson Studio. By creating a project in Watson Studio a free tier Object Storage service will be created in your IBM Cloud account.

3. Create the notebook
3.1. Additional notes for the notebook.
4. Add the data
Add the data to the notebook

5. Update the notebook with service credentials
Add the Object Storage credentials to the notebook

6. Flow of the notebook

The notebook has been divided into various sections with each section performing a specific task on the OrientDB.

7. Run the notebook

When a notebook is executed, what is actually happening is that each code cell in the notebook is executed, in order, from top to bottom.

Each code cell is selectable and is preceded by a tag in the left margin. The tag format is In [x]:. Depending on the state of the notebook, the x can be:

There are several ways to execute the code cells in your notebook:

For this Notebook, to run every cell one by one is recommended so as to understand the flow of the notebook and also to comprehend the operation performed by each cell on OrientDB better.

8. Analyze the results

The notebook uses two use cases to demonstrate how to get insights from the OrientDB like the most mentioned movie and the clustering of the movies with IMDb rating greater than 7. Each insight has its own function in the notebook. Check the cell Core Functions in notebook, you will find the functions for the same. Call those functions to get the results. The following image shows the functions and its results.

OrientDB also provides an interactive dashboard OrientDB studio for visualization of the graph and to view the results of the queries. You can run the queries in the browse section of the OrientDB studio to get the desired insights or to create the node and Edges. The same two queries which the notebook uses i.e. to get the most mentioned movie and the clustering of the movies with IMDb rating greater than 7 can be executed in the browse section of the OrientDB to analyze the results, check the screenshot of the OrientDB Studio below for the same. The results of the query executed are available in the form of table and JSON. And the results can also be downloaded as CSV for further analysis.

* run the Query to cluster the movies with IMDb rating greater than 7 and view the results in table format

* run both the Queries to get the most_mentioned movie and view results in the form of the table

* run the Query for most_mentioned and view the results in the json format

To visualize the graph created by using the functions written in the notebook,


This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.