• Home
  • About
  • KR

[Deep Learning Series] Understanding Backpropagation

The Process by Which Neural Networks Adjust Values to Reduce Errors


[Deep Learning Series] Understanding Backpropagation

In this post, following the previous post, I want to learn about Backpropagation. As explained earlier, because this algorithm made learning in Multi Layer Networks known to be possible, the Neural Network academic community, which was in a dark age, received attention again.

What is Backpropagation?

Backpropagation is one of the common algorithms for training Artificial Neural Networks today. Directly translating to Korean means 역전파 (reverse propagation). It’s an algorithm that calculates how much difference there is between the target value you want to obtain and the output actually calculated by the model, then propagates that error value back while updating variables each node has.

Fortunately, up to here it’s intuitively understandable, but I was curious about the principles of the following 2 things:


  1. How to update variables like weight or bias that each node has?
  2. In Multi Layer Networks, variables each node or layer has are all different - how can you know how much to change those values?

Fortunately, these problems can apparently be solved using a rule called Chain Rule. Let’s look step by step.

What is Chain Rule?

Chain Rule, also called the chain rule of differentiation. This isn’t taught in high school but learned in college mathematics, so I, who only learned discrete mathematics in college, had a hard time understanding. Let’s first look at the definition.


When functions f,gf, g exist, if ff and gg are both differentiable and F=f(g(x))=fgF = f(g(x)) = f \circ g is a defined composite function, then FF is differentiable.

At this time F(x)=f(g(x))g(x)F'(x) = f'(g(x)) \centerdot g'(x).

If we say t=g(x)t = g(x), then dydx=dtdxdydt\frac{dy}{dx} = \frac{dt}{dx} \frac{dy}{dt} holds.


Looking at the definition, you might wonder what it means. First, composite functions are just functions where other functions are given as arguments to some function. Writing it in code, you can quickly know what feeling it is.

function f(g) {
  return g * 3;
}
function g(x) {
  return x + 1;
}

const x = 3;
let F = f(g(x));
// F = 12

Then what does it mean to be differentiable? If you understand differentiation as something like finding slopes between x and x', you might think “why find slopes in composite functions or whatever?”

But saying find slopes is the same as finding change amounts. Looking at the above code, you can see that when declaring variable F, if you change values given to g, the final F value changes.

F = f(g(4));
// F = 15
F = f(g(2));
// F = 9

In other words, Chain Rule, simply speaking, means you can know how much function g changes when x changes, and how much function f changes due to function g’s change, and if you can know such chain changes, ultimately you can know the contribution of each function f and g contributing to final value F’s change amount.

Just now in the composite function F used as an example above, the variable entering the expression was just one x. Then what happens if there are multiple variables?


In two-variable function z=f(x,y)z = f(x, y) when x=h(s,t),y=g(s,t)x = h(s,t), y = g(s,t)

If f(x,y),g(s,t),h(s,t)f(x,y), g(s,t), h(s,t) are all differentiable

zs=zxxs+zyyszt=zxxt+zyyt\begin{aligned} \frac{\partial z}{\partial s} = \frac{\partial z}{\partial x}\frac{\partial x}{\partial s} + \frac{\partial z}{\partial y}\frac{\partial y}{\partial s} \\ \\ \frac{\partial z}{\partial t} = \frac{\partial z}{\partial x}\frac{\partial x}{\partial t} + \frac{\partial z}{\partial y}\frac{\partial y}{\partial t} \\ \end{aligned}

can be represented.


This means even if we don’t know how much ss or tt is, anyway when they change, function zz’s change amount can be obtained in such expressions.

And \partial is a symbol meaning partial differentiation - it’s a differentiation method that leaves one main variable and completely ignores the rest. So in expressions finding the relationship between ss and zz, you can see tt isn’t there at all.

If you understand up to here, now let’s really look at how Backpropagation proceeds.

Forward-propagation

Now let’s directly calculate how Backpropagation is performed.

Before that, we must first proceed with Forward Propagation. After proceeding with calculation using initialized ww values and input xx, we must first find whether desired values come out, and if not, how much they differ.

The model I’ll use for this calculation is as below:

model

This model is a 2-Layer NN model with 2 inputs, 2 outputs, and 2 Hidden Layers. Now let’s assign values to each variable.

add var

First, y1y_1’s value I want as output is 0.2, y2y_2’s value is 0.7. And as inputs, I put 0.2 in x1x_1 and 0.5 in x2x_2, and just put each ww value as feels right.

I want to use Sigmoid function as Activation Function and Mean Squared Error function as Error Function in this calculation.

Let’s first calculate values Layer0 will receive. Usually calculated with matrices.

z10=[x1x2]×[w100w200]z11=[x1x2]×[w110w210]\begin{aligned} z_{10} = \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} \times \begin{bmatrix} w^0_{10} & w^0_{20} \end{bmatrix} \\ \\ z_{11} = \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} \times \begin{bmatrix} w^0_{11} & w^0_{21} \end{bmatrix} \\ \end{aligned}

Unfolding that matrix multiplication becomes as follows and ultimately appears in the form of sums of wxwx.

z10=x1w100+x2w200=(0.2×0.1)+(0.5×0.3)=0.02+0.15=0.17z11=x1w110+x2w210=(0.2×0.2)+(0.5×0.1)=0.04+0.05=0.09\begin{aligned} z_{10} = x_1w^0_{10} + x_2w^0_{20} = (0.2\times0.1) + (0.5\times0.3) = 0.02 + 0.15 = 0.17 \\ \\ z_{11} = x_1w^0_{11} + x_2w^0_{21} = (0.2\times0.2) + (0.5\times0.1) = 0.04 + 0.05 = 0.09 \\ \end{aligned}

If we found z10z_{10} and z11z_{11} values, now let’s find a10a_{10} and a11a_{11} values using Activation Function. The expression for Sigmoid, the Activation Function I’ll use, is as follows:

σ=11+ex\sigma = \frac{1}{1 + e^{-x}}

Since calculating this by hand every time is too cumbersome, I made one function like the following using JavaScript and used it.

function sigmoid(x) {
  return 1 / (1 + Math.exp(-x));
}
a10=σ(z10)=0.54a11=σ(z11)=0.52\begin{aligned} a_{10} = \sigma(z_{10}) = 0.54 \\ \\ a_{11} = \sigma(z_{11}) = 0.52 \\ \end{aligned}

Continuing to find values the same way for the next layer, we can find values like the following:

z10=0.17a10=0.54z11=0.09a11=0.52z20=0.27a20=0.57z21=0.43a21=0.61\begin{aligned} z_{10} = 0.17 \\ a_{10} = 0.54 \\ \\ z_{11} = 0.09 \\ a_{11} = 0.52 \\ \\ z_{20} = 0.27 \\ a_{20} = 0.57 \\ \\ z_{21} = 0.43 \\ a_{21} = 0.61 \\ \end{aligned}

Ultimately, since y1y_1 and y2y_2 are respectively the same as a20a_{20} and a21a_{21}, we obtained final output values. But y1y_1 and y2y_2 we originally wanted were 0.2 and 0.7, but outputs we obtained are 0.57 and 0.61, with distance.

Now it’s time to find error EE using Mean Squared Error function. When calling values we hope to obtain as results tt and actual resulting values yy, error EE is as follows:

E=12(tiyi)2E = \frac{1}{2}\sum(t_i - y_i)^2

To explain easily for math dropouts like me, just find how much yy values spit out from the final Output Layer differ one by one from labeled y^\hat{y}, then average those values.

Ultimately, training ANN can be seen as approximating this error EE value obtained like this to 0. Just backpropagate the EE value that came out here.

Since calculating this by hand every time is also annoying, let’s just make one function.

function MSE(targets, values) {
  if (values instanceof Array === false) {
    return false;
  }

  let result = 0;
  targets.forEach((target, i) => {
    result += 0.5 * (target - values[i]) ** 2;
  });

  return result;
}

MSE([0.2, 0.7], [0.57, 0.61]); // 0.072

Now let’s proceed with Backpropagation using error EE value found here.

Backpropagation

Looking again at values found through Forward Propagation as a diagram is as follows:

backprop1

Among these, I want to update the w101w^1_{10} value currently assigned as 0.4. To do that, I must find how much w101w^1_{10} affected total error EE - that is, contribution. Here the Chain Rule explained above is used.

Unfolding w101w^1_{10}’s contribution to EE as an expression is as follows:

Ew101=Ea20a20z20z20w101\begin{aligned} \frac{\partial E}{\partial w^1_{10}} = \frac{\partial E}{\partial a_{20}} \frac{\partial a_{20}}{\partial z_{20}} \frac{\partial z_{20}}{\partial w^1_{10}} \\ \end{aligned}

Let’s first unfold Ea20\frac{\partial E}{\partial a_{20}} in order. Originally EE we found was the expression below:

E=12((t1a20)2+(t2a21)2)E = \frac{1}{2}((t_1 - a_{20})^2 + (t_2 -a_{21})^2)

Here, since a20=y1,a21=y2a_{20} = y_1, a_{21} = y_2, I substituted them. But since Ea20\frac{\partial E}{\partial a_{20}} is a partial differential expression, just think of a21a_{21} unrelated to values we’re finding now as 0 and solve.

Ea20=(t1a20)1+0=(0.20.57)×1=0.37\begin{aligned} \frac{\partial E}{\partial a_{20}} = (t_1 - a_{20}) * -1 + 0 = (0.2 - 0.57) \times -1 = 0.37 \\ \end{aligned}

What this calculation result means is that a20a_{20}, that is y1y_1, contributed 0.37 to total error EE. Let’s continue calculating like this.

a20z20=a20×(1a20)=0.57×(10.57)=0.25\begin{aligned} \frac{\partial a_{20}}{\partial z_{20}} = a_{20} \times (1 - a_{20}) = 0.57 \times (1 - 0.57) = 0.25 \\ \end{aligned}
z20w101=a10+0=0.54\begin{aligned} \frac{\partial z_{20}}{\partial w^1_{10}} = a_{10} + 0 = 0.54 \\ \end{aligned}
Ew101=0.37×0.25×0.54=0.049\begin{aligned} \frac{\partial E}{\partial w^1_{10}} = 0.37 \times 0.25 \times 0.54 = 0.049 \\ \end{aligned}

Ultimately we calculated that the value w101w^1_{10} contributed to EE is 0.049. Now putting this value in the learning expression, we can update the w101w^1_{10} value.

At this time, a value called Learning Rate is needed that decides how much to skip values or how quickly to learn, etc. This is just a constant humans decide, and is usually set lower than 0.1, but I set it as 0.3.

w101+=w101(LEw101)=0.4(0.3×0.049)=0.3853\begin{aligned} w^{1+}_{10} = w^1_{10} - (L * \frac{\partial E}{\partial w^1_{10}}) = 0.4 - (0.3 \times 0.049) = 0.3853 \end{aligned}

Like this I obtained the new w101w^1_{10} value 0.3853. Let’s continue updating other ww values like this. This time I’ll update the w010w^0{10} value of Layer0, which is one layer deeper than Layer1.

backprop2

As can be seen, w100w^0_{10} affects more values than w101w^1_{10}. The degree w100w^0_{10} contributed to total error EtE_t can be represented as follows:

Etw100=(E1a10+E2a10)a10z10z10w100\begin{aligned} \frac{\partial E_t}{\partial w^0_{10}} = (\frac{\partial E_1}{\partial a_{10}} + \frac{\partial E_2}{\partial a_{10}}) \frac{\partial a_{10}}{\partial z_{10}} \frac{\partial z_{10}}{\partial w^0_{10}} \end{aligned}

Then let’s first find E1a10\frac{\partial E_1}{\partial a_{10}}.

E1a10=E1a20a20z20z20a10=(t1a20)×a20×(1a20)×w101=(0.20.57)×0.57×(10.57)×0.4=0.03627\begin{aligned} \frac{\partial E_1}{\partial a_{10}} = \frac{\partial E_1}{\partial a_{20}} \frac{\partial a_{20}}{\partial z_{20}} \frac{\partial z_{20}}{\partial a_{10}}\\ \\ = -(t_1 - a_{20}) \times a_{20} \times (1 - a_{20}) \times w^1_{10} \\ \\ = -(0.2 - 0.57) \times 0.57 \times (1 - 0.57) \times 0.4 \\ \\ = 0.03627 \end{aligned}

Similarly, let’s also find E2a10\frac{\partial E_2}{\partial a_{10}}.

E2a10=E2a21a21z21z21a10=(t2a21)×a21×(1a21)×w111=(0.70.61)×0.61×(10.61)×0.5=0.0107\begin{aligned} \frac{\partial E_2}{\partial a_{10}} = \frac{\partial E_2}{\partial a_{21}} \frac{\partial a_{21}}{\partial z_{21}} \frac{\partial z_{21}}{\partial a_{10}}\\ \\ = -(t_2 - a_{21}) \times a_{21} \times (1 - a_{21}) \times w^1_{11} \\ \\ = -(0.7 - 0.61) \times 0.61 \times (1 - 0.61) \times 0.5 \\ \\ = -0.0107 \end{aligned}

Now that we found all E1a10\frac{\partial E_1}{\partial a_{10}} and E2a10\frac{\partial E_2}{\partial a_{10}}, it’s time to find Etw100\frac{\partial E_t}{\partial w^0_{10}}.

Etw100=(E1a10+E2a10)a10z10z10w100=(0.03627+(0.0107))×0.2484×0.54=0.0034\begin{aligned} \frac{\partial E_t}{\partial w^0_{10}} = (\frac{\partial E_1}{\partial a_{10}} + \frac{\partial E_2}{\partial a_{10}}) \frac{\partial a_{10}}{\partial z_{10}} \frac{\partial z_{10}}{\partial w^0_{10}} \\ \\ = (0.03627 + (-0.0107)) \times 0.2484 \times 0.54 \\ \\ = 0.0034 \end{aligned}

Thus we found that w100w^0_{10} contributes 0.0034 to total error EtE_t. Now let’s update the w100w^0_{10} value using this value. Learning Rate is the same 0.3 as before.

w100+=w100(LEtw100)=0.1(0.3×0.0034)=0.09897\begin{aligned} w^{0+}_{10} = w^0_{10} - (L * \frac{\partial E_t}{\partial w^0_{10}}) = 0.1 - (0.3 \times 0.0034) = 0.09897 \end{aligned}

Coding

Since I’m absolutely not a person who can solve this 8 times by hand, I simply wrote the formulas explained above in code using JavaScript.

function sigmoid(x) {
  return 1 / (1 + Math.exp(-x));
}

function MSE(targets, values) {
  if (values instanceof Array === false) {
    return false;
  }

  let result = 0;
  targets.forEach((target, i) => {
    result += 0.5 * (target - values[i]) ** 2;
  });

  return result;
}

// Initialize inputs
const x1 = 0.2;
const x2 = 0.5;

// Initialize target values
const t1 = 0.2;
const t2 = 0.7;

// Initialize Weights
const w0 = [
  [0.1, 0.2],
  [0.3, 0.1],
];
const w1 = [
  [0.4, 0.5],
  [0.1, 0.3],
];
const learningRate = 0.3;
const limit = 1000; // Number of trainings

// Update Weights of second Layer
function updateSecondLayerWeight(targetY, y, prevY, updatedWeight) {
  const v1 = -(targetY - y) + 0;
  const v2 = y * (1 - y);
  const def = v1 * v2 * prevY;
  return updatedWeight - learningRate * def;
}

// Update Weights of first Layer
function updateFirstLayerWeight(t1, t2, y1, y2, w1, w2, a, updatedWeight) {
  const e1 = -(t1 - y1) * y1 * (1 - y1) * w1;
  const e2 = -(t2 - y2) * y2 * (1 - y2) * w2;
  const v1 = a * (1 - a);
  const v2 = a;
  const def = (e1 + e2) * v1 * v2;

  return updatedWeight - learningRate * def;
}

// Start training
let i = 0;
for (i; i < limit; i++) {
  let z10 = x1 * w0[0][0] + x2 * w0[1][0];
  let a10 = sigmoid(z10);
  let z11 = x1 * w0[0][1] + x2 * w0[1][1];
  let a11 = sigmoid(z11);

  let z20 = a10 * w1[0][0] + a11 * w1[1][0];
  let a20 = sigmoid(z20);
  let z21 = a10 * w1[0][1] + a11 * w1[1][1];
  let a21 = sigmoid(z21);

  let e_t = MSE([t1, t2], [a20, a21]);

  console.log(`[${i}] y1 = ${a20}, y2 = ${a21}, E = ${e_t}`);

  // Update with new Weights using calculated contributions
  const newW0 = [
    [
      updateFirstLayerWeight(t1, t2, a20, a21, w1[0][0], w1[0][1], a10, w0[0][0]),
      updateFirstLayerWeight(t1, t2, a20, a21, w1[1][0], w1[1][1], a11, w0[0][1]),
    ],
    [
      updateFirstLayerWeight(t1, t2, a20, a21, w1[0][0], w1[0][1], a10, w0[1][0]),
      updateFirstLayerWeight(t1, t2, a20, a21, w1[1][0], w1[1][1], a11, w0[1][1]),
    ],
  ];
  const newW1 = [
    [updateSecondLayerWeight(t1, a20, a10, w1[0][0]), updateSecondLayerWeight(t2, a21, a10, w1[0][1])],
    [updateSecondLayerWeight(t1, a20, a11, w1[1][0]), updateSecondLayerWeight(t2, a21, a11, w1[1][1])],
  ];

  // Reflect updated Weights
  newW0.forEach((v, i) => {
    v.forEach((vv, ii) => (w0[i][ii] = vv));
  });
  newW1.forEach((v, i) => {
    v.forEach((vv, ii) => (w1[i][ii] = vv));
  });
}

console.log(`t1 = ${t1}, t2 = ${t2}`);

Looking at calculated results, you can see x1 and x2 initially put into Multi Layer Network gradually converging to t1 and t2. Since we can’t see all 1000 run results, I’m attaching first, middle, and last progress situations.

result first
result second
result third

In the next post, I want to create a more structured network. That’s all for this Backpropagation post.

관련 포스팅 보러가기

Jul 17, 2018

[Deep Learning Series] What is Deep Learning?

Programming/Machine Learning
Feb 26, 2019

Building a Simple Artificial Neural Network with TypeScript

Programming/Machine Learning
May 06, 2017

[Implementing Gravity with JavaScript] 2. Coding

Programming/Graphics
May 06, 2017

[Implementing Gravity with JavaScript] 1. What is Gravity?

Programming/Graphics
Oct 30, 2019

Simplifying Complex Problems with Math

Programming/Algorithm