libdl
0.0.1
Simple yet powerful deep learning
|
Optimization in the context of deep learning often means finding a local minimum of the loss function via some form of optimization routine, which usually needs to automatically compute the derivative of the model's output function (e.g. gradient descend needs the first order derivative and the Newton method even the second order). Thus, the support for automatic differentiation (autodiff), i.e., evaluating the gradient at a desired coordinate without manually calculating the derivative, is central to any deep learning library that actually wants to support the learning part.
In a twist of sheer elegance, consider the following observation. For \(a, b\in\mathbb R\), we call \(z=a+b\varepsilon\) a dual number, where \(\varepsilon \neq 0\) is a new symbol defined such that \(\varepsilon^2 = 0\). We will now use dual numbers to calculate both, the evaluation of the function and the evaluation of its derivative. First, observe that for arbitary dual numbers \(u+\dot u\varepsilon, v+\dot v\varepsilon\) it holds that
\begin{align} (u+\dot u\varepsilon) + (v+\dot v\varepsilon) &= (u+v) + (\dot u + \dot v)\varepsilon, \text{and}\\ (u+\dot u\varepsilon) \cdot (v+\dot v\varepsilon) &= uv + (u\dot v + \dot uv)\varepsilon. \end{align}
Notice how this expresses the linearity of differentiation and the product rule, if we interpret \(\dot u\) and \(\dot v\) as the derivatives at point \(u\) and \(v\) respectively. Now let \(f\colon \mathbb{R}\to\mathbb{R}\) be a differentiable function. We extend it into the dual numbers via
\[f(x + \dot x \varepsilon) := f(x) + f'(x)\dot x\varepsilon.\]
Note that this definition makes sense with our previous interpretation. Further, we can now observe that the chain rule also holds:
\begin{align} f(g(x + \dot x\varepsilon) &= f(g(x) + g'(x)\dot x\varepsilon))\\ &= f(g(x)) + f'(g(x))g'(x)\varepsilon. \end{align}
Given that this actually holds in general (which it does), we can now simultaneously evaluate \(f(x)\) and \(f'(x)\) by calculating \(f(x+1\varepsilon)\) and taking the real part for \(f(x)\) and the coefficient of \(\varepsilon\) is the result of \(f'(x)\).
But what about functions \(f\colon \mathbb{R}^n \to \mathbb{R}\) with more inputs? Since the forward pass can only calculate the derivative in one direction per pass, we need \(n\) passes to compute the gradient \(\nabla f\), which is quite inefficient and, where the reverse mode automatic differentiation shines.
Consider the composite function \(h := f\circ g\), the chain rule states that the derivative \(h\) at point \(x\) can be calculated via
\[\left.\frac{\partial h}{\partial x}\right|_{x} = \left.\frac{\partial f}{\partial u}\right|_{u=g(x)} \cdot \left.\frac{\partial u}{\partial x}\right|_{x}.\]
This is crucial since it allows us to compute the gradient at a single point without first calculating the first order derivative over all points (as would be done with symbolic differentiation). Instead, we have to overload the semantics of each function to calculate the value (as it usually would) and additionally store the computation information somewhere to allow computing the gradient in the reverse order of the functions. That is, in the example above, the forward pass would first compute \(y = f(x)\), then \(z = g(y)\) and then go backwards and compute the gradients \(\frac{\partial z}{\partial y}\) and then \(\frac{\partial y}{\partial x}\).
This way, we can compute all derivatives at the same time (and thus are more efficient than forward mode) but need to store the intermediate results and the computation order for the backwards pass (and thus are less performant than forward mode).
This example with equip the logarithm, dl::log(dl::TensorPtr x)
with support for automatic differentiation. We will start of with the basic stub
Remember from the description above that, to enable a function for the backwards pass, it only needs to register a callback for calculating the derivative to dl::TensorImpl::gradfn
.
\[\frac{\partial AB}{\partial A} = B^\top\]