CD2H gitForager

NVIDIA/torch-cunn

Name: torch-cunn

Owner: NVIDIA Corporation

Description: null

Forked from: torch/cunn

Created: 2016-08-15 23:07:12.0

Updated: 2016-08-15 23:07:27.0

Pushed: 2017-01-19 01:02:20.0

Homepage: null

Size: 1273

Language: Cuda

GitHub Committers

User	Most Recent Commit	# Commits

Other Committers

User	Email	Most Recent Commit	# Commits

README

CUDA backend for the Neural Network Package

This package provides a CUDA implementation for many of the modules in the base nn package: nn

Modules: There are also additional GPU-related modules not found in the nn package.

To use

Simply convert your network model to CUDA by calling :cuda():

l model = nn.Sequential()
l:add(nn.Linear(2,2))
l:add(nn.LogSoftMax())

l:cuda()  -- convert model to CUDA

… and similarly for your tensors:

l input = torch.Tensor(32,2):uniform()
t = input:cuda()
l output = model:forward(input)

… or create them directly as CudaTensors:

l input = torch.CudaTensor(32,2):uniform()
l output = model:forward(input)

To run unit-tests

it -l cunn -e 'cunn.test()'

GPU Training Concepts

Performance

data should be transferred between main memory and gpu in batches, otherwise the transfer time will be dominated by latency associated with speed of light, and execution overheads, rather than by bandwidth
therefore, train and predict using mini-batches
allocating GPU memory causes a sync-point, which will noticeably affect performance
therefore try to allocate any CudaTensors once, at the start of the program, and then simply copy data backwards and forwards between main memory and existing CudaTensors
similarly, try to avoid any operations that implicitly allocate new tensors. For example, if you write:
```
ire 'cutorch'
```

local a = torch.CudaTensor(1000):uniform() for it=1,1000 do local b = torch.add(a, 1) end

this will allocate one thousand new `CudaTensor`s, one for each call to `torch.add(a, 1)`.

instead this form:

require 'cutorch'

local a = torch.CudaTensor(1000):uniform() local b = torch.CudaTensor(1000):uniform() for it=1,1000 do b:add(a, 1) end

his form, `b` is allocated only once, before the loop.  Then the `b:add(a,1)` operation will perform
add inside the GPU kernel, and store the result into the original `b` `CudaTensor`.  This
 run noticeably faster, in general.  It's also a lot less likely to eat up arbitrary amounts of memory,
less likely to need frequent calls to `collectgarbage(); collectgarbage()`.

nchmarking__

U operations will typically continue after an instruction has been issued
, if you do:

require 'cutorch' local a = torch.CudaTensor(1000,1000):uniform() a:add(1)

the GPU kernel to add 1 will only be scheduled for launch by `a:add(1)`.  It might not have completed yet, or
 have reached the GPU, at the time that the `a:add(1)` instructions has completed
erefore for running wall-clock timings, you should call `cutorch.synchronize()` before each timecheck
t:

require 'cutorch' require 'sys'

local a = torch.CudaTensor(1000,1000):uniform() cutorch.synchronize() start = sys.tic() a:add(1) cutorch.synchronize() print(sys.toc())

This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.