Intel-bigdata/HDL

Name: HDL

Owner: Intel-bigdata

Description: Support Deep Learning on Hadoop platform

Created: 2017-03-13 05:57:55.0

Updated: 2018-03-27 09:20:40.0

Pushed: 2017-05-31 02:20:44.0

Homepage: null

Size: 52007

Language: null

GitHub Committers

UserMost Recent Commit# Commits

Other Committers

UserEmailMost Recent Commit# Commits

README

HDL - Deep Learning on Hadoop

Support Deep Learning on Hadoop platform, leveraging existing popular Deep Learning engines such as TensorFlow, MXNet, Caffe and Intel Caffe. Ref. HADOOP-13944

This project will focuses on the whole architecture, common facilities and high level considerations. It has respective sub project for each engine, as follows.

High level considerations

Architecture

Why on Hadoop - the rational

How TO Run

1. TensorFlow

Assume you are in TensorflowOnYARN dir.

  1. Prepare the build environment following the instructions from TensorFlow tutorial

  2. Run the between-graph mnist example.

Method One:

Apply resources (ClusterSpec) and run.

ydl-tf launch --num_worker 2 --num_ps 2

This will launch a YARN application, which creates a tf.train.Server instance for each task. A ClusterSpec is printed on the console such that you can submit the training script to. e.g.

terSpec: {"ps":["node1:22257","node2:22222"],"worker":["node3:22253","node2:22255"]}
ash
on examples/between-graph/mnist_feed.py \
--ps_hosts="ps0.hostname:ps0.port,ps1.hostname:ps1.port" \
--worker_hosts="worker0.hostname:worker0.port,worker1.hostname:worker1.port" \
--task_index=0

on examples/between-graph/mnist_feed.py \
--ps_hosts="ps0.hostname:ps0.port,ps1.hostname:ps1.port" \
--worker_hosts="worker0.hostname:worker0.port,worker1.hostname:worker1.port" \
--task_index=1

Method Two:

Directly submit TensorFlow training jobs and parameters to YARN.

on bin/demo.py "bin/ydl-tf" "launch" "examples/between-graph/mnist_feed.py"
  1. To get ClusterSpec of an existing TensorFlow cluster launched by a previous YARN application.
ydl-tf cluster --app_id <Application ID>
  1. You can also use YARN commands through ydl-tf.

For example, get running application list,

ydl-tf application --list

or kill an existing YARN application(TensorFlow cluster),

ydl-tf kill --application <Application ID>
2. Caffe

Assume you are in CaffeOnYARN dir.

  1. Train mnist with the jar package, prototxt and parameters. The number means the number of service we launch.
ydl-caffe -jar ydl-caffe.jar -conf /path/lenet_memory_solver.prototxt -model hdfs:///mnist.model -num 3
  1. Check the log using the applicationId we get from the screen
 logs -applicationId xxxxxxxxxx | less

or kill an existing YARN application,

 application -kill <Application ID>
3. MXNet

Assume you are in MXNetOnYARN dir.

  1. Train mnist in distributed model.
ydl-mx 2 train_mnist.py --kv-store sync
  1. Check the log using the applicationId we get from the screen
 logs -applicationId xxxxxxxxxxxx | less

or kill an existing YARN application,

 application -kill <Application ID>

This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.