Name: torch-autograd
Owner: Twitter, Inc.
Description: Autograd automatically differentiates native Torch code
Created: 2015-10-06 14:51:56.0
Updated: 2018-01-03 11:27:41.0
Pushed: 2017-08-29 17:35:10.0
Size: 2342
Language: Lua
GitHub Committers
User | Most Recent Commit | # Commits |
---|
Other Committers
User | Most Recent Commit | # Commits |
---|
Autograd automatically differentiates native Torch code. Inspired by the original Python version.
Autograd has multiple goals:
Jan 21, 2016: Two big new user-facing features:
x[k] = v
inside optimize=true autograd code, where k can be a number, table or LongTensor
, and v can be a tensor or number, whichever is appropriate. Here's a few examples.autograd.optimize(true)
or take the derivative of your function using df = autograd(f, {optimize = true})
. Check out a simple example in our testsautograd.util.cat
can work with numbers, or tensors of any time. autograd.util.cast
can cast a nested table of tensors to any type you like).Nov 16, 2015:
Runtime performance was improved dramatically, as well as ease of use with
better debugging tools. Performance is now within 30% of a statically described
version of an equivalent model (nn
and nngraph
).
nan
or inf
will trigger a callback, that can be used to render a DOT representation of the
graph (see debugging)a:add(b)
forbidden,
use res = torch.add(a,b)
instead)d(f, {...})
to compute subparts of the
graph (fprop or bprop), useful to generate a compiled fprop (see fine grained control)Nov 6, 2015: initial release.
A simple neural network with a multinomial logistic loss:
ibraries:
require 'torch'
= require 'autograd'
efine trainable parameters:
ms = {
= {
t.randn(100,50),
t.randn(50,10),
,
= {
t.randn(50),
t.randn(10),
efine model
alNet = function(params, x, y)
ocal h1 = t.tanh(x * params.W[1] + params.b[1])
ocal h2 = t.tanh(h1 * params.W[2] + params.b[2])
ocal yHat = h2 - t.log(t.sum(t.exp(h2)))
ocal loss = - t.sum(t.cmul(yHat, y))
eturn loss
radients:
ralNet = grad(neuralNet)
ome data:
t.randn(1,100)
t.Tensor(1,10):zero() y[1][3] = 1
ompute loss and gradients wrt all parameters in params:
ams, loss = dneuralNet(params, x, y)
n this case:
loss: is a scalar (Lua number)
dparams: is a table that mimics the structure of params; for
each Tensor in params, dparams provides the derivatives of the
loss wrt to that Tensor.
Important note: only variables packed in the first argument of the eval function will have their gradients computed. In the example above, if the gradients wrt x are needed, then x simply has to be moved into params. The params table can be arbitrarily nested.
See more complete examples in examples.
Assuming the model defined above, and a training set of {x,y}
pairs,
the model can easily be optimized using SGD:
i,sample in datasetIterator() do
- estimate gradients wrt params:
ocal grads, loss = dneuralNet(params, sample.x, sample.y)
- SGD step:
or i = 1,#params.W do
-- update params with an arbitrary learning rate:
params.W[i]:add(-.01, grads.W[i])
params.b[i]:add(-.01, grads.b[i])
nd
To enable the optimizer, which produces optimized representations of your loss and gradient functions (as generated lua code):
= require 'autograd'
.optimize(true) -- global
l df = grad(f, { optimize = true }) -- for this function only
l grads = df(params)
Benefits:
Caveats:
The nn library provides with all sorts of very optimized primitives, with gradient code written and optimized manually. Sometimes it's useful to rely on these for maximum performance.
Here we rewrite the neural net example from above, but this time relying on a mix of
nn
primitives and autograd
-inferred gradients:
ibraries:
require 'torch'
= require 'autograd'
efine trainable parameters:
ms = {
inear1 = {
t.randn(50,100), -- note that parameters are transposed (nn convention for nn.Linear)
t.randn(50),
,
inear2 = {
t.randn(10,50),
t.randn(10),
nstantiate nn primitives:
ote: we do this outside of the eval function, so that memory
s only allocated once; moving these calls to within the body
f neuralNet would work too, but would be quite slower.
ar1 = grad.nn.Linear(100, 50)
1 = grad.nn.Tanh()
ar2 = grad.nn.Linear(50, 10)
2 = grad.nn.Tanh()
efine model
alNet = function(params, x, y)
ocal h1 = acts1(linear1(params.linear1, x))
ocal h2 = acts2(linear2(params.linear2, h1))
ocal yHat = h2 - t.log(t.sum(t.exp(h2)))
ocal loss = - t.sum(t.cmul(yHat, y))
eturn loss
radients:
ralNet = grad(neuralNet)
ome data:
t.randn(1,100)
t.Tensor(1,10):zero() y[1][3] = 1
ompute loss and gradients wrt all parameters in params:
ams, loss = dneuralNet(params, x, y)
This code is stricly equivalent to the code above, but will be more efficient (this is especially true for more complex primitives like convolutions, …).
3rd party libraries that provide a similar API to nn can be registered like this:
l customnnfuncs = grad.functionalize('customnn') -- requires 'customnn' and wraps it
le = customnnfuncs.MyNnxModule(...)
nder the hood, this is already done for nn:
.nn = grad.functionalize('nn')
On top of this functional API, existing nn
modules and containers, with arbitarily
nested parameters, can also be wrapped into functions. This is particularly handy
when doing transfer learning from existing models:
efine a standard nn model:
l model = nn.Sequential()
l:add(nn.SpatialConvolutionMM(3, 16, 3, 3, 1, 1, 1, 1))
l:add(nn.Tanh())
l:add(nn.Reshape(16*8*8))
l:add(nn.Linear(16*8*8, 10))
l:add(nn.Tanh())
ote that this model could have been pre-trained, and reloaded from disk.
unctionalize the model:
l modelf, params = autograd.functionalize(model)
he model can now be used as part of a regular autograd function:
l loss = autograd.nn.MSECriterion()
alNet = function(params, x, y)
ocal h = modelf(params, x)
eturn loss(h, y)
ote: the parameters are always handled as an array, passed as the first
rgument to the model function (modelf). This API is similar to the other
odel primitives we provide (see below in "Model Primitives").
ote 2: if there are no parameters in the model, then you need to pass the input only, e.g.:
l model = nn.Sigmoid()
unctionalize :
l sigmoid = autograd.functionalize(model)
he sigmoid can now be used as part of a regular autograd function:
l loss = autograd.nn.MSECriterion()
alNet = function(params, x, y)
ocal h = sigmoid(x) -- please note the absence of params arg
eturn loss(h, y)
For those who have a training pipeline that heavily relies on the torch/nn API,
torch-autograd defines the autograd.nn.AutoModule
and autograd.nn.AutoCriterion
functions. When given a name
, it will create
a new class locally under autograd.auto.name. This class can be instantiated by providing a function, a weight, and a bias.
They are also clonable, savable and loadable.
Here we show an example of writing a 2-layer fully-connected module and an MSE criterion using AutoModule
and AutoCriterion
:
Here we rewrite the neural net example from above, but this time relying on a mix of
nn
primitives and autograd
-inferred gradients:
efine functions for modules
inear
l linear = function(input, weight, bias)
ocal y = weight * input + bias
eturn y
inear + ReLU
l linearReLU = function(input, weight, bias)
ocal y = weight * input + bias
ocal output = torch.mul( torch.abs( y ) + y, 0.5)
eturn output
efine function for criterion
SE
l mse = function(input, target)
ocal buffer = input-target
eturn torch.sum( torch.cmul(buffer, buffer) ) / (input:dim() == 2 and input:size(1)*input:size(2) or input:size(1))
nput size, nb of hiddens
l inputSize, outputSize = 100, 1000
efine auto-modules and auto-criteria
nd instantiate them immediately
l autoModel = nn.Sequential()
l autoLinear1ReLU = autograd.nn.AutoModule('AutoLinearReLU')(linearReLU, linear1.weight:clone(), linear1.bias:clone())
l autoLinear2 = autograd.nn.AutoModule('AutoLinear')(linear, linear2.weight:clone(), linear2.bias:clone())
Model:add( autoLinear1ReLU )
Model:add( autoLinear2 )
l autoMseCriterion = autograd.nn.AutoCriterion('AutoMSE')(mse)
t this point, print(autograd.auto) should yield
AutoLinearReLU : {...}
AutoMSE : {...}
AutoLinear : {...}
efine number of iterations and learning rate
l n = 100000
l lr = 0.001
l autoParams,autoGradParams = autoModel:parameters()
l unifomMultiplier = torch.Tensor(inputSize):uniform()
rain: this should learn how to approximate e^(\alpha * x)
ith an mlp aith both auto-modules and regular nn
i=1,n do
utoModel:zeroGradParameters()
ocal input = torch.Tensor(inputSize):uniform(-5,5):cmul(uniformMultiplier)
ocal target = input:clone():exp()
- Forward
ocal output = autoModel:forward(input)
ocal mseOut = autoMseCriterion:forward(output, target)
- Backward
ocal gradOutput = autoMseCriterion:backward(output, target)
ocal gradInput = autoModel:backward(input, gradOutput)
or i=1,#autoParams do
autoParams[i]:add(-lr, autoGradParams[i])
nd
For ease of mind (and to write proper tests), a simple grad checker is provided. See test.lua for complete examples. In short, it can be used like this:
arameters:
l W = t.Tensor(32,100):normal()
l x = t.Tensor(100):normal()
unction:
l func = function(inputs)
eturn t.sum(inputs.W * inputs.x)
heck grads wrt all inputs:
er:assert(gradcheck(func, {W=W, x=x}), 'incorrect gradients on W and x')
To ease the construction of new models, we provide primitives to generate standard models.
Each constructor returns 2 things:
f
: the function, can be passed to grad(f)
to get gradientsparams
: the list of trainable parametersOnce instantiated, f
and params
can be used like this:
t = torch.randn(10)
= f(params, input)
s = autograd(f)(params, input)
Current list of model primitives includes:
API:
rams = autograd.model.NeuralNetwork({
- number of input features:
nputFeatures = 10,
- number of hidden features, per layer, in this case
- 2 layers, each with 100 and 10 features respectively:
iddenFeatures = {100,10},
- activation functions:
ctivations = 'ReLU',
- if true, then no activation is used on the last layer;
- this is useful to feed a loss function (logistic, ...)
lassifier = false,
- dropouts:
ropoutProbs = {.5, .5},
API:
rams = autograd.model.SpatialNetwork({
- number of input features (maps):
nputFeatures = 3,
- number of hidden features, per layer:
iddenFeatures = {16, 32},
- poolings, for each layer:
oolings = {2, 2},
- activation functions:
ctivations = 'Sigmoid',
- kernel size:
ernelSize = 3,
- dropouts:
ropoutProbs = {.1, .1},
API:
rams = autograd.model.RecurrentNetwork({
- number of input features (maps):
nputFeatures = 100,
- number of output features:
iddenFeatures = 200,
- output is either the last h at step t,
- or the concatenation of all h states at all steps
utputType = 'last', -- or 'all'
API:
rams = autograd.model.RecurrentLSTMNetwork({
- number of input features (maps):
nputFeatures = 100,
- number of output features:
iddenFeatures = 200,
- output is either the last h at step t,
- or the concatenation of all h states at all steps
utputType = 'last', -- or 'all'
Similarly to model primitives, we provide common loss functions in
autograd.loss
:
ross entropy between 2 vectors:
for categorical problems, the target should be encoded as one-hot)
= loss.crossEntropy(prediction, target)
inary cross entropy - same as above, but labels are considered independent bernoulli variables:
= loss.binaryEntropy(prediction, target)
east squares - mean square error between 2 vectors:
= loss.leastSquares(prediction, target)
autograd can be called from within an autograd function, and the resulting gradients can used as part of your outer function:
l d = require 'autograd'
timize(true)
l innerFn = function(params)
- compute something...
l ddf = d(function(params)
ocal grads = d(innerFn)(params)
- do something with grads of innerFn...
l gradGrads = ddf(params) -- second order gradient of innerFn
Debugging hooks can be inserted when wrapping the function with autograd
.
The debugger will turn off any optimizations and insert NaN/Inf checks
after every computation. If any of these trip the debugHook will be called
with a message providing as much information as possible about the
offending function, call stack and values. The debugHook also provides
an interface to save or render a GraphViz dot file of the computation
graph. We don't recommend leaving the debugHook installed all the time
as your training speed will be significantly slower.
(f, {
ebugHook = function(debugger, msg, gen)
-- dump a dot representation of the graph:
debugger.generateDot('result.dot')
-- or show it (OSX only, uses Safari):
debugger.showDot()
-- print the generated source line that caused the inf/nan
print(string.split(gen.source, "\n")[gen.line])
nd
Consider this usage of autograd, it clearly contains a divide by zero.
l W = torch.Tensor(32,100):fill(.5)
l x = torch.Tensor(100):fill(.5)
l func = function(inputs)
eturn torch.sum(torch.div(inputs.W * inputs.x, 0)) -- DIV ZERO!
l dFunc = autograd(func, {
ebugHook = function(debugger, msg)
debugger.showDot()
print(msg)
os.exit(0)
nd
c({W=W, x=x})
Will output:
grad debugger detected a nan or inf value for locals[1]
: fn@path/to/code/example.lua:4
And render in Safari as:
Finer-grain control over execution can also be achieved using these flags:
ll of these options default to true:
(f, {
ithForward = true | false, -- compute the forward path
ithGradients = true | false, -- compute the gradients (after forward)
artialGrad = true | false -- partial grad means that d(f) expects grads wrt output
unning this:
= grad(f, {withForward=true, withGradients=false})(inputs)
s equivalent to:
= f(inputs)
.. but the function is compiled, and benefits from tensor re-use!
Licensed under the Apache License, Version 2.0. See LICENSE file.