Name: cunn
Owner: Torch
Description: null
Created: 2013-10-18 12:30:34.0
Updated: 2018-01-11 11:59:00.0
Pushed: 2017-11-21 11:01:22.0
Homepage: null
Size: 1501
Language: Cuda
GitHub Committers
User | Most Recent Commit | # Commits |
---|
Other Committers
User | Most Recent Commit | # Commits |
---|
This package provides a CUDA implementation for many of the modules in the base nn package: nn
clone https://github.com/torch/cunn
unn
ocks make rocks/cunn-scm-1.rockspec
Simply convert your network model to CUDA by calling :cuda()
:
l model = nn.Sequential()
l:add(nn.Linear(2,2))
l:add(nn.LogSoftMax())
l:cuda() -- convert model to CUDA
… and similarly for your tensors:
l input = torch.Tensor(32,2):uniform()
t = input:cuda()
l output = model:forward(input)
… or create them directly as CudaTensor
s:
l input = torch.CudaTensor(32,2):uniform()
l output = model:forward(input)
it -l cunn -e 'cunn.test()'
Performance
CudaTensor
s once, at the start of the program,
and then simply copy data backwards and forwards
between main memory and existing CudaTensor
sire 'cutorch'
local a = torch.CudaTensor(1000):uniform() for it=1,1000 do local b = torch.add(a, 1) end
this will allocate one thousand new `CudaTensor`s, one for each call to `torch.add(a, 1)`.
instead this form:
require 'cutorch'
local a = torch.CudaTensor(1000):uniform() local b = torch.CudaTensor(1000):uniform() for it=1,1000 do b:add(a, 1) end
his form, `b` is allocated only once, before the loop. Then the `b:add(a,1)` operation will perform
add inside the GPU kernel, and store the result into the original `b` `CudaTensor`. This
run noticeably faster, in general. It's also a lot less likely to eat up arbitrary amounts of memory,
less likely to need frequent calls to `collectgarbage(); collectgarbage()`.
nchmarking__
U operations will typically continue after an instruction has been issued
, if you do:
require 'cutorch' local a = torch.CudaTensor(1000,1000):uniform() a:add(1)
the GPU kernel to add 1 will only be scheduled for launch by `a:add(1)`. It might not have completed yet, or
have reached the GPU, at the time that the `a:add(1)` returns
erefore for running wall-clock timings, you should call `cutorch.synchronize()` before each timecheck
t:
require 'cutorch' require 'sys'
local a = torch.CudaTensor(1000,1000):uniform() cutorch.synchronize() start = sys.tic() a:add(1) cutorch.synchronize() print(sys.toc())