Intel-bigdata/TensorFlowOnYARN

Name: TensorFlowOnYARN

Owner: Intel-bigdata

Description: Support TensorFlow on YARN

Created: 2017-03-13 06:02:01.0

Updated: 2018-05-23 08:24:20.0

Pushed: 2017-06-19 07:27:04.0

Homepage:

Size: 141

Language: Java

GitHub Committers

UserMost Recent Commit# Commits

Other Committers

UserEmailMost Recent Commit# Commits

README

TensorFlowOnYARN Build Status

TensorFlow on YARN (TOY) is a toolkit to enable Hadoop users an easy way to run TensorFlow applications in distributed pattern and accomplish tasks including model management and serving inference.

Goals

Note that current project is a prototype with limitation and is still under development

Architecture

Figure1. TOY Architecture

Features
Quick Start
  1. Prepare the build environment following the instructions from https://www.tensorflow.org/install/install_sources

  2. Clone the TensorFlowOnYARN repository.

    clone --recursive https://github.com/Intel-bigdata/TensorFlowOnYARN
    
  3. Build the assembly.

    ensorFlowOnYARN/tensorflow-parent
    package -Pnative -Pdist
    

    tensorflow-yarn-${VERSION}.tar.gz and tensorflow-yarn-${VERSION}.zip are built out in the tensorflow-parent/tensorflow-yarn-dist/target directory. Distribute the assembly to the client node of a YARN cluster and extract.

  4. Run the between-graph mnist example.

    ensorflow-yarn-${VERSION}
    ydl-tf launch --num_worker 2 --num_ps 2
    

    This will launch a YARN application, which creates a tf.train.Server instance for each task. A ClusterSpec is printed on the console such that you can submit the training script to. e.g.

    terSpec: {"ps":["node1:22257","node2:22222"],"worker":["node3:22253","node2:22255"]}
    
    ash
    on examples/between-graph/mnist_feed.py \
    s_hosts="ps0.hostname:ps0.port,ps1.hostname:ps1.port" \
    orker_hosts="worker0.hostname:worker0.port,worker1.hostname:worker1.port" \
    ask_index=0
    
    on examples/between-graph/mnist_feed.py \
    s_hosts="ps0.hostname:ps0.port,ps1.hostname:ps1.port" \
    orker_hosts="worker0.hostname:worker0.port,worker1.hostname:worker1.port" \
    ask_index=1
    
  5. To get ClusterSpec of an existing TensorFlow cluster launched by a previous YARN application.

    ydl-tf cluster --app_id <Application ID>
    
  6. You may also use YARN commands through ydl-tf.

    For example, to get running application list,

    ydl-tf application --list
    

    or to kill an existing YARN application(TensorFlow cluster),

    ydl-tf kill --application <Application ID>
    

This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.