NVIDIA/gpu-rest-engine

Name: gpu-rest-engine

Owner: NVIDIA Corporation

Description: A REST API for Caffe using Docker and Go

Created: 2016-02-26 19:49:43.0

Updated: 2018-01-17 04:53:55.0

Pushed: 2017-12-26 21:59:16.0

Homepage:

Size: 265

Language: C++

GitHub Committers

UserMost Recent Commit# Commits

Other Committers

UserEmailMost Recent Commit# Commits

README

Introduction

This repository shows how to implement a REST server for low-latency image classification (inference) using NVIDIA GPUs. This is an initial demonstration of the GRE (GPU REST Engine) software that will allow you to build your own accelerated microservices.

This demonstration makes use of several technologies with which you may be familiar:

Building

Prerequisites
Build command (Caffe)

The command might take a while to execute:

cker build -t inference_server -f Dockerfile.caffe_server .

To speedup the build you can modify this line to only build for the GPU architecture that you need.

Build command (TensorRT)

This command requires the TensorRT archive to be present in the current folder.

cker build -t inference_server -f Dockerfile.tensorrt_server .

Testing

Starting the server

Execute the following command and wait a few seconds for the initialization of the classifiers:

cker run --runtime=nvidia --name=server --net=host --rm inference_server

You can use the environment variable NVIDIA_VISIBLE_DEVICES to isolate GPUs for this container.

Single image

Since we used --net=host, we can access our inference server from a terminal on the host using curl:

rl -XPOST --data-binary @images/1.jpg http://127.0.0.1:8000/api/classify
onfidence":0.9998,"label":"n02328150 Angora, Angora rabbit"},{"confidence":0.0001,"label":"n02325366 wood rabbit, cottontail, cottontail rabbit"},{"confidence":0.0001,"label":"n02326432 hare"},{"confidence":0.0000,"label":"n02085936 Maltese dog, Maltese terrier, Maltese"},{"confidence":0.0000,"label":"n02342885 hamster"}]
Benchmarking performance

We can benchmark the performance of our classification server using any tool that can generate HTTP load. We included a Dockerfile for a benchmarking client using rakyll/hey:

cker build -t inference_client -f Dockerfile.inference_client .
cker run -e CONCURRENCY=8 -e REQUESTS=20000 --net=host inference_client

If you have Go installed on your host, you can also benchmark the server with a client outside of a Docker container:

 get github.com/rakyll/hey
y -n 200000 -m POST -D images/2.jpg http://127.0.0.1:8000/api/classify
Performance on a NVIDIA DIGITS DevBox

This machine has 4 GeForce GTX Titan X GPUs:

y -c 8 -n 200000 -m POST -D images/2.jpg http://127.0.0.1:8000/api/classify
ary:
tal:        100.7775 secs
owest:      0.0167 secs
stest:      0.0028 secs
erage:      0.0040 secs
quests/sec: 1984.5690
tal data:   68800000 bytes
ze/request: 344 bytes
]

As a comparison, Caffe in standalone mode achieves approximately 500 images / second on a single Titan X for inference (batch=1). This shows that our code achieves optimal GPU utilization and good multi-GPU scaling, even when adding a REST API on top. A discussion of GPU performance for inference at different batch sizes can be found in our GPU-Based Deep Learning Inference whitepaper.

This inference server is aimed for low-latency applications, to achieve higher throughput we would need to batch multiple incoming client requests, or have clients send multiple images to classify. Batching can be added easily when using the C++ API of Caffe. An example of this strategy can be found in this article from Baidu Research, they call it “Batch Dispatch”.

Benchmarking overhead of CUDA kernel calls

Similarly to the inference server, a simple server code is provided for estimating the overhead of using CUDA kernels in your code. The server will simply call an empty CUDA kernel before responding 200 to the client. The server can be built using the same commands as above:

cker build -t benchmark_server -f Dockerfile.benchmark_server .
cker run --runtime=nvidia --name=server --net=host --rm benchmark_server

And for the client:

cker build -t benchmark_client -f Dockerfile.benchmark_client .
cker run -e CONCURRENCY=8 -e REQUESTS=200000 --net=host benchmark_client
]
ary:
tal:        5.8071 secs
owest:      0.0127 secs
stest:      0.0001 secs
erage:      0.0002 secs
quests/sec: 34440.3083   
Contributing

Feel free to report issues during build or execution. We also welcome suggestions to improve the performance of this application.


This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.