Backpropagation: Step-By-Step Derivation | by Dr. Roi Yehoshua | Apr, 2023

Full information to the algorithm used to coach neural networks
Within the previous article we talked about multi-layer perceptrons (MLPs) as the primary neural community mannequin that would clear up non-linear and complicated issues.
For a very long time it was not clear the best way to practice these networks on a given information set. Whereas single-layer perceptrons had a easy studying rule that was assured to converge to an answer, it couldn’t be prolonged to networks with multiple layer. The AI neighborhood has struggled with this downside for greater than 30 years (in a interval generally known as the “AI winter”), when ultimately in 1986 Rumelhart et al. launched the backpropagation algorithm of their groundbreaking paper [1].
On this article we are going to talk about the backpropagation algorithm intimately and derive its mathematical formulation step-by-step. Since that is the principle algorithm used to coach neural networks of all types (together with the deep networks now we have right this moment), I consider it will be helpful to anybody working with neural networks to know the main points of this algorithm.
Though you could find descriptions of this algorithm in lots of textbooks and on-line sources, in writing this text I’ve tried to maintain the next rules in thoughts:
- Use clear and constant notations
- Clarify each step of the mathematical derivation
- Derive the algorithm for essentially the most basic case, i.e., for networks with any variety of layers and any activation or loss features
After deriving the backpropagation equations, an entire pseudocode for the algorithm is given after which illustrated on a numerical instance.
Earlier than studying the article, I like to recommend that you simply refresh your calculus information, particularly within the space of derivatives (together with partial derivatives and the chain rule of derivatives).
Now seize a cup of espresso and let’s dive in 🙂
The backpropagation algorithm consists of three phases:
- Ahead go. On this part we feed the inputs via the community, make a prediction and measure its error with respect to the true label.
- Backward go. We propagate the gradients of the error with respect to every one of many weights backward from the output layer to the enter layer.
- Gradient descent step. We barely tweak the connection weights within the community by taking a step in the wrong way of the error gradients.
We are going to now go over every considered one of these phases in additional element.
Within the ahead go, we propagate the inputs in a ahead course, layer-by-layer, till the output is generated. The activation of neuron i in layer l is computed utilizing the next equation:
the place f is the activation perform, zᵢˡ is the online enter of neuron i in layer l, wᵢⱼˡ is the connection weight between neuron j in layer l — 1 and neuron i in layer l, and bᵢˡ is the bias of neuron i in layer l. For extra particulars on the notations and the derivation of this equation see my previous article.
To simplify the derivation of the training algorithm, we are going to deal with the bias as if it have been the load w₀ of an enter neuron x₀ that has a relentless worth of 1. This permits us to jot down the above equation as follows:
Within the backward go we propagate the gradients of the error from the output layer again to the enter layer.
Definition of the Error and Loss Capabilities
We first outline the error of the community on the coaching set with respect to its weights. Let’s denote by w the vector that comprises all of the weights of the community.
Assume that now we have n coaching samples {(xᵢ, yᵢ)}, i = 1,…,n, and the output of the community on pattern i is oᵢ. Then the error of the community with respect to w is:
the place J(y, o) is the loss perform. The particular loss perform that we use is dependent upon the duty the community is attempting to perform:
- For regression issues, we use the squared loss perform:
2. For binary classification issues, we use log loss (also called the binary cross-entropy loss):
3. For multi-class classification downside, we use the cross-entropy loss perform:
the place ok is the variety of courses.
The the explanation why we use these particular loss features might be defined in a future article.
Our objective is to seek out the weights w that decrease E(w). Sadly, this perform is non-convex due to the non-linear activations of the hidden neurons. Because of this it could have a number of native minima:
There are numerous strategies that can be utilized to stop gradient descent from getting caught in a neighborhood minimal, reminiscent of momentum. These strategies might be lined in future articles.
Discovering the Gradients of the Error
In an effort to use gradient descent, we have to compute the partial derivatives of E(w) with respect to every one of many weights within the community:
To simplify the mathematical derivation, we are going to assume that now we have just one coaching instance and discover the partial derivatives of the error with respect to that instance:
the place y is the label of this instance and o is the output of the community for that instance. The extension to n coaching samples is easy, because the by-product of the sum of features is simply the sum of their derivatives.
The computation of the partial derivatives of the weights within the hidden layers just isn’t trivial, since these weights don’t have an effect on straight the output (and therefore the error). To deal with this downside, we are going to use the chain rule of derivatives to determine a relationship between the gradients of the error in a given layer and the gradients within the subsequent layer.
The Delta Phrases
We first be aware that E is dependent upon the load wᵢⱼˡ solely through the online enter zᵢˡ of neuron i in layer l. Subsequently, we are able to apply the chain rule of derivatives to the gradient of E with respect to this weight:
The second by-product on the precise aspect of the equation is:
Subsequently, we are able to write:
The variable δᵢ known as the delta time period of neuron i or delta for brief.
The Delta Rule
The delta rule establishes the connection between the delta phrases in layer l and the delta phrases in layer l + 1.
To derive the delta rule, we once more use the chain rule of derivatives. The loss perform relies upon on the web enter of neuron i solely through the online inputs of all of the neurons it’s linked to in layer l + 1. Subsequently we are able to write:
the place the index j within the sum goes over all of the neurons in layer l + 1 that neuron i in layer l is linked to.
As soon as once more we use the chain rule to decompose the second partial by-product contained in the brackets:
The primary partial by-product contained in the brackets is simply the delta of neuron j in layer l + 1, due to this fact we are able to write:
The second partial by-product is simple to compute:
Subsequently we get:
However aᵢˡ = f(zᵢˡ), the place f is the activation perform. Therefore, the partial by-product outdoors the sum is simply the by-product of the activation perform f’(x) for x = zᵢˡ.
Subsequently we are able to write:
This equation, generally known as the delta rule, exhibits the connection between the deltas in layer l and the deltas in layer l + 1. Extra particularly, every delta in layer l is a linear mixture of the deltas in layer l + 1, the place the coefficients of the mixture are the connection weights between these layers. The delta rule permits us to compute all of the delta phrases (and thus all of the gradients of the error) recursively, ranging from the deltas within the output layer and going again layer-by-layer till we attain the enter layer.
The next diagram illustrates the circulation of the error info:
For particular activation features, we are able to derive extra express equations for the delta rule. For instance, if we use the sigmoid perform then:
The by-product of the sigmoid perform has a easy type:
Therefore:
Then the delta rule for the sigmoid perform will get the next type:
The Deltas within the Output Layer
The ultimate piece of the puzzle are the delta phrases within the output layer, that are the primary ones that we have to compute.
The deltas within the output layer rely each on the loss perform and the activation perform used within the output neurons:
the place f is the activation perform used to compute the output.
- In regression issues, the activation perform we use within the output is the id perform f(x) = x, whose by-product is 1, and the loss perform is the squared loss. Subsequently the delta is:
2. In binary classification issues, the activation perform we use is sigmoid and the loss perform is log loss, due to this fact we get:
In different phrases, the delta is solely the distinction between the community’s output and the label.
3. In multiclass classification issues, now we have ok output neurons (the place ok is the variety of courses) and we use softmax activation and the cross-entropy log loss. Just like the earlier case, the delta time period of the ith output neuron is surprisingly easy:
the place oᵢ is the i-th element of the community’s prediction and yᵢ is the i-th element of the label. The proof is considerably longer, and you could find it on this well-written Medium article.
As soon as we end computing all of the delta phrases, we are able to use gradient descent to replace the weights. In gradient descent, we take small steps in the wrong way of the gradient with a view to get nearer to the minimal error:
Do not forget that the partial by-product of the error perform with respect to every weight is:
Subsequently, we are able to write the gradient descent replace rule as follows:
the place α is a studying price that controls the step measurement (0 < α < 1). In different phrases, we subtract from the load between neuron j in layer l — 1 and neuron i in layer l the delta of neuron i multiplied by the activation of neuron j (scaled by the training price).
Gradient descent could be utilized in one of many following modes:
- Batch gradient descent — the weights are up to date after we compute the error on the complete coaching set.
- Stochastic gradient descent (SGD) — a gradient descent step is carried out after each coaching instance. Usually converges sooner than batch gradient descent however is much less secure.
- Mini-batch gradient descent — a center means between batch gradient descent and SGD. We use small batches of random coaching samples (usually between 10 to 1,000 examples) for the gradient updates. This reduces the noise in SGD however continues to be extra environment friendly than full-batch updates, and it’s the commonest type to coach neural networks.
We at the moment are able to current the complete algorithm in its full glory:
As an train, attempt to implement this algorithm in Python (or your favourite programming language).
Think about that now we have a binary classification downside with two binary inputs and a single binary output. Our neural community has two hidden layers with the next weights:
The activation perform within the hidden layers and within the output unit is the sigmoid perform, and the training price is α = 0.5.
The community is introduced with a coaching instance with the inputs x₁ = 1 and x₂ = 0, and the goal label is y = 1. Let’s carry out one iteration of the backpropagation algorithm to replace the weights.
We begin with ahead propagation of the inputs:
The output of the community is 0.6718 whereas the true label is 1, therefore we have to replace the weights with a view to improve the community’s output and make it nearer to the label.
We first compute the delta on the output node. Since this can be a binary classification downside we use the log loss perform, and the delta on the output is o — y.
We now propagate the deltas from the output neuron again to the enter layer utilizing the delta rule:
Be aware how the deltas develop into more and more smaller as we return within the layers, inflicting the early layers within the community to coach very slowly. This phenomenon, generally known as the vanishing gradients, was one of many predominant the explanation why backpropagation was not profitable in coaching deep networks (till deep studying has appeared).
Lastly, we carry out one step of gradient descent:
Let’s do one other ahead go to see if the community’s output grew to become nearer to the goal:
Certainly, the output has elevated from 0.6718 to 0.7043!
Closing Notes
All photographs until in any other case famous are by the writer.
Thanks for studying!
References
[1] Rumelhart, David E., Geoffrey E. Hinton, and Ronald J. Williams. “Studying representations by back-propagating errors.” nature 323.6088 (1986): 533–536.