Intel-bigdata/MXNetOnYARN

Name: MXNetOnYARN

Owner: Intel-bigdata

Description: Support running MXNet on YARN

Created: 2017-03-15 03:01:53.0

Updated: 2018-04-29 15:24:24.0

Pushed: 2017-04-07 03:43:50.0

Homepage:

Size: 240

Language: Java

GitHub Committers

UserMost Recent Commit# Commits

Other Committers

UserEmailMost Recent Commit# Commits

README

MXNetOnYARN

MXNet on YARN is a project based on dmlc-core and MXNet, aiming at running MXNet on YARN with high efficiency and flexibility. It's an important part of Deep Learning on Hadoop (HDL).

Note that both the codebase and documentation are work in progress. They may not be the final version.

Overview

Performing large scale training or predicting tasks with high efficiency is always a big challenge in machine learning and deep learning. Because in most cases, training datasets, computation graphs or variables etc, are always too large for single node. To address this issue, distributed support (or enabling) on current cluster environments, i.e., Hadoop YARN, is very import for current deep learning frameworks.

In this project, we proposed MXNetOnYARN to enable MXNet distributed training and serving on Hadoop YARN, which has been used successfully to run all sorts of distributed data applications.

Figure1. Basic Design

With the help of MXNetOnYARN, users can submit multiple MXNet training or serving tasks to existing YARN cluster without any modifications on jobs or any worry about building environment or dependencies, etc. MXNetOnYARN will handle all details about distributed machine learning with high efficiency and flexibility. When the tasks are finished, all resources will be released, with results and logs be saved to HDFS.

Get Started

This note describes how to deploy and run the training on Yarn.

The basic command is as follows:

mx [cluster options] [task options]

Users can use this command to submit training or serving tasks with specific parameters or datasets. Note that

cluster options is used to specify the distributed environment, e.g., number of workers and servers launched. Normally, users only needs to specify the number of workers, then MXNetOnYARN will launch the same number of servers. So --n 2 means launch 2 workers and 2 server, while --n 2 --s 1 means launch 2 workers and 1 server.

task options is used to specify the detailed configure of machine learning tasks, e.g., job.py --kv-store sync --data-dir for distributed training. The basic format is similar to single node MXNet tasks without python prefix. So, all MXNet parameters are supported, and can be added to task options.

Pre-Preparation

1. Distribution Support in your MXNet jobs

In your code, modify demos with distribution support (e.g., train_mnist.py), or add create a kvstore and explicitly set it in your model.

kv = mx.kvstore.create('dist_sync')
model = mx.model.FeedForward.create(symbol = net, X = data, kvstore = kv, ...)

The use of parameter server is based on the kvstore class in MXNet.

2. Prepare ydl-mx.jar

Download and extract ydl-mx.jar from our release package, and put it in your working directory. Note that libmxnet-scala.so is packaged into ydl-mx.jar. So, if it doesn't work well on your platform, please build your own ydl-mx.jar according to mxnet-parent/mxnet-yarn-app/build.md.

How to Run

For example, we can submit the application like this:

/bin/ydl-mx --n 2 --jobname MXNetOnYarn --jar ydl-mx.jar train_minist.py --kv-store sync --data-dir .

cluster options:

`--jobname` specifies the name of job
`--n` specifies the number of workers and servers
`--jar` specifies the path for `ydl-mx.jar`.

task options

train_minist.py --kv-store sync --data-dir .

This command is used to execute the mnist.py example with distributed support, i.e., --kv-store sync.

Serving (Inference)
Support

This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.