WEBVTT
00:20.240 --> 00:27.240
This is the second lecture of module-1 on
intelligent control. The outline of this lecture
00:29.650 --> 00:36.650
is like this: revision of the previous lecture,
multilayer feed-forward network, back propagation
00:38.550 --> 00:41.960
learning algorithm and learning XOR map.
00:41.960 --> 00:47.809
These are the topics we will be discussing
today. These are the summaries that we discussed
00:47.809 --> 00:53.920
in the last class. What you are seeing is
a single neuron. We have multiple inputs that
00:53.920 --> 00:59.489
are multiplied with the weight associated
with this connection. They are summed up in
00:59.489 --> 01:06.180
this summing junction and then you have an
activation function. In linear neural network,
01:06.180 --> 01:13.180
what we consider is that this is simply a
constant. This I can say is linear.
01:13.810 --> 01:20.490
The batch update law that we said is that
new weight vector is old weight vector plus
01:20.490 --> 01:27.490
eta into summation of all error terms due
to all the patterns, whereas instantaneous
01:29.420 --> 01:36.420
update is done simply by the back propagation
term due to a single input pattern, where
01:37.320 --> 01:41.020
delta is the error back propagated.
01:41.020 --> 01:48.020
We will take another example today. This example
is where we have two poles within the unity
01:50.429 --> 01:57.429
circle, but they are all real poles; minus
0.9 and 0.8. This is my system dynamics and
02:01.249 --> 02:08.249
my model is this. y p k is w1 y k minus 1
w2 y k minus 2 plus w3 u k minus 2.
02:13.319 --> 02:20.319
I take the same single neuron with three inputs,
y k minus 1 y k minus 2 and u k minus 2. When
02:23.000 --> 02:30.000
I do that, I finally get these exact values;
w1 is 1.72. w1 converges to 1.7, w2 converges
02:33.330 --> 02:40.330
to minus 0.72 and w3 converges to 0.02. This
is what Error is minimum at w1 1.72, minus
02:44.400 --> 02:51.400
0.72 is w2, and w3 is 0.02. This error surface
looks like this and finally, here is your
02:53.410 --> 02:54.670
global minimum.
02:54.670 --> 03:01.670
We talked about a single layer linear feed
forward network. Now, let us talk about single
03:02.470 --> 03:09.470
layer nonlinear feed forward networks. In
this, the activation function becomes non-linear.
03:10.860 --> 03:17.290
So, f is a non-linear function. The question
arises as to what kind of nonlinear function
03:17.290 --> 03:21.960
we can consider as an activation function.
03:21.960 --> 03:28.960
Here is a non-linear activation function.
The axis is x and this is f of x and the function
03:32.040 --> 03:39.040
is f of x is equal to 1 by 1 plus e to the
power of minus alpha x. By changing alpha
03:40.300 --> 03:47.300
value, we can change the transition from 0
saturation to saturation 1. The function varies
03:53.760 --> 04:00.760
from 0 to 1 and alpha determines either the
sharp rise or slow rise from 0 saturation
04:07.610 --> 04:14.610
level to 1 saturation level. This is a non-linear
function. Why did we select this activation
04:17.530 --> 04:18.289
function?
04:18.289 --> 04:25.289
If we look at it, you may ask what is the
speciality of this activation function? It
04:26.720 --> 04:33.720
is very clear from the previous graph of this
function that f of x is always limited or
04:38.599 --> 04:45.599
is limited by boundary 0 and 1. f of x is
less than 1 greater than 0, but the interesting
04:47.960 --> 04:51.930
part of this function is if I differentiate
this function f of x, d f of x upon dx is
04:51.930 --> 04:58.930
equal to alpha f of x into 1 minus f of x.
We can see that derivative of such an activation
05:03.340 --> 05:07.810
function has a very simple form.
05:07.810 --> 05:14.810
We had an activation function of this form:
f of x is equal to 1 upon 1 plus e to the
05:16.289 --> 05:23.289
power of minus alpha x. I said that the differentiation
of this activation function is very simple
05:26.279 --> 05:33.279
to compute. If you do that, you see that I
get 1 upon square of this
05:46.930 --> 05:53.930
and then you differentiate this. Finally,
this is alpha e to the power of minus alpha
06:00.090 --> 06:07.090
x upon 1 plus e to the power of minus alpha
x into 1 upon 1 plus e power minus alpha x.
06:12.490 --> 06:19.490
You can verify; this function is f of x and
this function is alpha into 1 minus f of x.
06:25.939 --> 06:32.939
This f of x becomes f of x into 1 minus f
of x and an alpha here. Although it is a nonlinear
06:38.379 --> 06:45.379
function, its computation is very simple because,
it is in terms of f of x. That is the specialty
06:46.990 --> 06:50.340
of this activation function.
06:50.340 --> 06:54.539
We derive for f of x is equal to 1 upon 1
plus e to the power of minus alpha x which
06:54.539 --> 07:01.539
normally we call as sigmoid function. For
the sigmoid function, we showed that f of
07:02.430 --> 07:09.430
x equal to alpha f of x into 1 minus f of
x. Because of this simple computation, this
07:10.219 --> 07:17.039
is one of the reasons also for making a choice
of this activation function. Here is another
07:17.039 --> 07:23.860
example of an activation function. The previous
one was unipolar because the activation function
07:23.860 --> 07:30.860
it has a value, from 0 to 1 as you vary the
input x, from minus infinity to plus infinity.
07:34.889 --> 07:41.889
But now this activation function when you
vary x from minus infinity to plus infinity
07:42.379 --> 07:49.379
you have value minus 1 to 1 and this is a
tangent hyperbolic function; the tangent hyperbolic
07:53.360 --> 08:00.360
function e to the power of alpha x minus e
to the power minus alpha x upon e to the power
08:02.219 --> 08:09.219
of alpha x plus e to the power of minus alpha
x. Again alpha is there just to control the
08:11.210 --> 08:18.210
nonlinearity in the function.
08:19.479 --> 08:26.479
The activation function has a limit from minus
1 to 1. So it is a bipolar activation function
08:27.509 --> 08:34.229
and as we derived earlier, you can derive
and you can verify that the differentiation
08:34.229 --> 08:41.219
or derivative of this activation function
f of x with respect to x is alpha into 1 plus
08:41.219 --> 08:48.219
f of x into 1 minus f of x. Again, it is a
very simple computational form.
08:50.240 --> 08:57.240
For this, if you use the sigmoid activation
function that is, the previous one that is
09:00.350 --> 09:07.350
1 upon 1 plus e to the power alpha x, then
we can go back to the basic principle of gradient
09:11.230 --> 09:18.230
descent rule. If we apply the gradient descent
rule for this nonlinear single layer feed
09:23.620 --> 09:30.620
forward network and if you apply the gradient
descent rule, you will get batch update. This
09:31.500 --> 09:38.500
one is batch update and this is instantaneous
update, where the delta is y p 1 minus y p
09:43.680 --> 09:50.680
y d p minus y p. These are all specific to
specific pattern; p refers to a specific pattern.
09:52.839 --> 09:59.839
The structure of the update law remains same;
what we had earlier, the same structure. The
10:02.980 --> 10:09.980
only difference is that back error propagated
delta has become different. I will go back
10:10.149 --> 10:16.959
now to make a little comparison between linear
neural networks and non-linear neural networks
10:16.959 --> 10:22.880
in terms of the similarity and differences.
10:22.880 --> 10:29.880
I have linear neural network on one side,
here; here non-linear network. In a linear
10:42.990 --> 10:49.990
network, I have x1 x2 and xn and then you
have the computational unit; linear computational
10:56.199 --> 11:03.199
unit, w1 w2 wn and here y. In case of linear
this y is equal to sigma wixi. In case of
11:10.319 --> 11:17.319
a non-linear, the same thing; it is a non-linear
activation function. So, y becomes f of x.
11:17.790 --> 11:24.790
It the same as 1 upon 1 plus e to the power
minus sigma wixi. This is a non-linear network.
11:30.600 --> 11:37.600
What do we do? We apply the gradient descent
rule. wi t plus 1 equal to the wi t minus
11:41.740 --> 11:48.740
eta into del E by del wi. This is the gradient
descent rule. This is applicable for both
11:52.759 --> 11:57.339
linear neural network as well as a non-linear
neural network.
11:57.339 --> 12:04.339
We have already shown, for linear neural network
that this becomes wi t plus eta delta xi,
12:08.690 --> 12:15.690
where delta is the error propagated y desired
minus y.
We are only considering instantaneous update
12:25.029 --> 12:31.699
because mostly in control system we will deal
with the real time implementation. Real time
12:31.699 --> 12:37.259
implementation means instantaneous update,
we cannot do a batch update. That is why …… we
12:37.259 --> 12:44.259
will only consider instantaneous update and
for this delta is equal to y d minus y, when
12:48.199 --> 12:55.199
it is linear and you can compute the same
thing here for non-linear network and you
12:58.180 --> 13:05.180
get delta is y 1 minus y y d minus y. Here
we saw that delta for linear neural network
13:13.790 --> 13:20.790
is y d minus y and we computed the same delta
for non-linear network; y 1 minus y, y d minus
13:22.459 --> 13:29.459
y. This is the extra term that comes because
of differentiation of f. You can now appreciate
13:30.970 --> 13:37.970
why f has been selected as sigmoid activation
function. This is one of the advantages of
13:40.569 --> 13:42.019
computation.
13:42.019 --> 13:49.019
Let me take you through the total process
of the update of weights in simple neural
13:51.629 --> 13:58.629
network; linear and non-linear. Here it is
your input. Take the input pattern. Allow
13:59.100 --> 14:06.100
that pattern to be processed by the network.
Network actuates y. For each input pattern,
14:07.509 --> 14:14.509
there is a given y d, desired output. Compute
delta. This delta is y d minus y for linear
14:18.319 --> 14:25.319
network and for non-linear network, it is
y 1 minus y y d minus y. So, that error has
14:25.470 --> 14:32.470
been transferred and then you look at here,
in this side. This is your back error propagated,
14:32.709 --> 14:39.709
delta. This is your input pattern. Each weight
if you look at is simply being updated based
14:40.899 --> 14:47.899
on its input, x1 and what is the error being
back propagated. You can look at here the
14:51.000 --> 14:58.000
update algorithm is wi t plus eta delta xi.
For everything we can easily check, a very
14:59.279 --> 15:06.279
simple rule; the weight update is the original
weight plus eta, the learning rate into the
15:09.569 --> 15:12.759
input and the back error propagated.
15:12.759 --> 15:19.759
Although a single layer linear feed forward
network can learn any linear map, the same
15:21.220 --> 15:28.220
is not true with a single layer non-linear
feed forward network. We took last class and
15:28.870 --> 15:35.870
in today’s class we showed that any linear
dynamical system can be mapped or can be learned
15:39.519 --> 15:46.519
or can be identified using a single layer
linear neural network. But this is not the
15:47.129 --> 15:54.129
case with non-linear dynamical function. We
cannot identify or we cannot approximate any
15:56.560 --> 16:03.560
non-linear function using a single layer neural
network. So that is the bad part, bad news.
16:05.639 --> 16:12.639
But it turns out that a single layer nonlinear
feed forward network can only classify those
16:14.550 --> 16:21.410
functions which are linearly separable.
16:21.410 --> 16:28.410
We saw the difference between weight updates
in case of linear neural network and non-linear
16:32.689 --> 16:39.689
neural network when there is only single layer;
simple neural networks. We also showed that
16:43.850 --> 16:50.850
any single layer linear neural network can
approximate any linear dynamical system. If
17:04.010 --> 17:11.010
we could have mapped any nonlinear dynamical
system using a simple non-linear neural network,
17:19.260 --> 17:26.260
probably we would have been happy, but this
is not the case. It turns out that a single
17:35.789 --> 17:42.260
layer nonlinear neural network feed forward
network can only map those nonlinear functions
17:42.260 --> 17:49.260
that are linearly separable. Those functions
which are not linearly separable cannot be
17:51.710 --> 17:58.710
mapped or cannot be learnt using a single
layer non-linear feed forward network. This
18:02.370 --> 18:09.370
is the bad news. So, what is this linear separability?
Let us consider a simple static function.
18:11.620 --> 18:18.620
All of you know an OR function, OR network.
Any OR gate has two inputs, x1 and x2, output
18:21.059 --> 18:28.059
is yd. These are the truth table. Input is
minus 1 minus 1 output is minus 1 and in all
18:30.539 --> 18:35.529
other cases, the output is 1. That is the
OR network.
18:35.529 --> 18:42.529
Linearly separable - we want to address this
one. We have an OR function. All of you are
18:49.400 --> 18:55.190
already aware of the OR function. The truth
table is this becomes minus 1, this becomes
18:55.190 --> 19:02.190
minus 1. In all other cases, this is plus
1. This is your truth table. Given specific
19:09.000 --> 19:16.000
input pattern, output is either minus 1 or
plus 1. Now, let us look at graphical map
19:16.650 --> 19:23.650
of this function. Minus 1 minus 1 is here.
This is say 1. This is 1, minus 1 and minus
19:33.230 --> 19:40.230
1. This is your point, where it is minus 1
and all other cases like minus 1 plus 1, you
19:47.700 --> 19:54.700
have plus and plus 1 plus 1, you have a plus.
You have plus. You see that if I draw a line
20:11.140 --> 20:18.140
here, it is a linear line. It is able to separate
a specific class from another class. For this
20:25.220 --> 20:32.220
class of input pattern output is 1 and for
this class of input pattern, output is minus
20:32.309 --> 20:35.720
1.
20:35.720 --> 20:42.720
This particular function that is OR function
is known as linearly separable function. This
20:46.399 --> 20:53.399
can be easily mapped using a non-linear single
layer neural network; very simple. To conclude,
20:58.890 --> 21:05.890
a linearly separable function - a function
is linearly separable if there exists a hyperbola
21:11.169 --> 21:18.169
that distinguishes between one class and another
class. Thus linearly separable, but there
21:20.860 --> 21:27.860
are many cases, many functions which are not
linearly separable. A very simple example
21:27.919 --> 21:34.919
is XOR function. Let us look at another example
XOR map.
21:38.460 --> 21:45.460
In this function, we have two inputs and one
output and the truth table looks like this.
22:04.750 --> 22:11.750
This is your XOR map. For this case, you have
plus 1 output and the other two cases you
22:14.580 --> 22:21.580
have minus 1 output. Let us look at the graphical
map. When you have minus 1, This is your x1,
22:24.980 --> 22:31.980
this is your x2. So, minus 1 and minus 1,
you have minus and when you have plus 1 and
22:40.070 --> 22:47.070
plus 1, you have again minus and other two
cases you have here plus; that is plus 1 minus
22:50.799 --> 22:57.799
1 and when you have minus 1 plus 1 you have
also plus.
22:58.630 --> 23:05.630
In this case, you cannot find a single line
that will separate this class from this class.
23:06.980 --> 23:13.980
This class cannot be separated from this class
using a single line. There is a function where
23:17.020 --> 23:24.020
the function is not linearly separable, but
this is non-linearly separable. I can create
23:27.909 --> 23:34.909
a non-linear decision boundary that separates
this class from these two classes. This class,
23:35.350 --> 23:42.350
the minus class is separated from plus class
using a non-linear decision boundary. This
23:47.320 --> 23:54.320
is called the functions which are not linearly
separable. Let us summarize what we discussed
23:57.860 --> 24:04.860
now. Limitations of a single layer non-linear
feed forward network is a single layer non-linear
24:06.100 --> 24:13.100
feed forward network cannot even approximate
an XOR function; that is it can only approximate
24:14.059 --> 24:17.159
only those functions which are linearly separable.
24:17.159 --> 24:24.159
There are many complex functions. They are
not linearly separable. You can refer to Minsky
24:26.659 --> 24:33.659
and Papert’s book called “Perceptions”
to know more about this linear separability
24:34.309 --> 24:41.309
and non-linear separability. This book is
published by Cambridge MIT press, 1969. However,
24:44.409 --> 24:51.409
later researchers found that if we increase
the complexity of the network in terms of
24:52.269 --> 24:57.620
the number of layers, but we only consider
a single layer network; 2 layer, 3 layer,
24:57.620 --> 25:04.620
4 layer that is multi-layer network, then
such a feed forward network can always approximate
25:10.390 --> 25:12.640
any non-linear function.
25:12.640 --> 25:19.640
One of the very key research contributions
are by Hornik Stinchcombe and White published
25:22.010 --> 25:29.010
in 1989 in Neural Networks, Multilayer Feedforward
Networks are universal approximators. The
25:31.440 --> 25:38.440
problem that we faced from single layer neural
network for non-linear approximation can be
25:40.140 --> 25:47.140
mitigated or eliminated by increasing the
layers to more than 1. It turns out that even
25:51.570 --> 25:58.570
a 2 layer network with 1 hidden layer, consisting
of infinite neurons can also approximate a
26:03.169 --> 26:05.730
non-linear function.
26:05.730 --> 26:12.730
When this neural network becomes complex,
can we write the learning algorithm, the region
26:15.850 --> 26:22.850
the way we wrote for single layer network?
The answer is yes, little adjustments have
26:29.559 --> 26:36.559
to be done, but implementation-wise it is
not difficult. The algorithm that was derived
26:39.860 --> 26:46.860
using gradient descent for nonlinear neural
networks with nonlinear activation function
26:49.899 --> 26:56.899
is popularly known as back propagation learning
algorithm, although the learning algorithm
27:01.529 --> 27:06.269
still is derived using gradient descent rule.
27:06.269 --> 27:13.269
We will make this point very clear why this
learning algorithm is known as back propagation.
27:15.679 --> 27:22.679
We gave some hints in the single layer network
that computed error output is back propagated
27:23.640 --> 27:30.640
and based on that, the weights are being updated.
Based on that information, the weight update
27:32.039 --> 27:39.039
formula takes a very simple shape, simple
structure. Again, for a multi-layer network,
27:44.799 --> 27:51.799
is it very difficult to derive the learning
algorithm? We
will derive today the learning algorithm for
27:55.980 --> 28:02.980
a two layer network and in the next class
we will show, for any number of layers, the
28:07.169 --> 28:14.169
hidden layers may be of any numbers, it can
be in capital N, but the learning algorithm
28:17.899 --> 28:24.899
derivation is very simple. But all these learning
algorithms that we will derive in this course
28:25.169 --> 28:31.649
will be instantaneous update rule. The reason
being, again as I said, control systems are
28:31.649 --> 28:38.649
all real; they require real time implementation
and we have to be considerate from that point
28:41.510 --> 28:48.510
of view. So, let us summarize what we are
now going to do. This is the focus of today’s
28:50.110 --> 28:50.880
class.
28:50.880 --> 28:57.880
Multilayer feed forward network has more hidden
layers and again, when I say feed forward
29:01.409 --> 29:08.409
network, the connections are all allowed only
from any layer to its succeeding layer, but
29:10.899 --> 29:17.899
the connections are not allowed from any layer
to its preceding layer. The example is you
29:18.460 --> 29:25.460
see here there are four layers. These are
all inputs. First hidden layer, second hidden
29:28.889 --> 29:35.889
layer, third hidden layer and this is output
layer. When we say the number of layers, we
29:38.269 --> 29:45.269
do not count the input layer as one of the
layers. When I say two layered network, then
29:45.590 --> 29:49.820
I have only one hidden layer and next layer
becomes output layer.
29:49.820 --> 29:56.820
This particular configuration means there
are sub-units, sub-neurons here and this particular
29:58.440 --> 30:05.440
configuration, if I connect you will see why
I say feed forward network, because I am able
30:26.960 --> 30:33.960
to connect any layer from its preceding layer.
That means connections are allowed from the
30:35.100 --> 30:42.100
preceding layer to any layer, but I cannot
allow the feedback connection. This is called
30:42.620 --> 30:48.940
feedback connection; this is not allowed.
This is allowed. From this layer, I can connect
30:48.940 --> 30:55.940
to this layer. This is allowed, but I cannot
allow from this layer to connect to this layer.
31:00.710 --> 31:06.070
These are called feedback connections. They
are not allowed and that is why this is known
31:06.070 --> 31:13.070
as feed forward network. Today, we will derive
a two-layered feed forward neural network
31:13.100 --> 31:18.919
with sigmoid activation function. We can very
easily see that this is 1 layer; this is the
31:18.919 --> 31:25.919
only hidden layer and this is the only output
layer; output layer is always only one.
31:28.690 --> 31:35.690
What will we do? We have a certain convention
that we will put while deriving a back propagation
31:37.899 --> 31:44.899
learning algorithm for this. The same simple
principle; given training data, we allow the
31:45.230 --> 31:52.110
input to pass through the network, compute
the error here, use the gradient descent rule
31:52.110 --> 31:58.740
and the back propagated error are used to
modify the weights here that is between output
31:58.740 --> 32:05.740
layer and hidden layer and again another form
of back propagated error here has to be used
32:10.740 --> 32:17.740
for modification of the weights between input
layer and hidden layer. This is again the
32:20.639 --> 32:23.720
convention that we will use.
32:23.720 --> 32:30.720
i is the index for a typical neuron in the
output layer. This you can see here, i is
32:31.480 --> 32:38.480
the index for neurons that are used in this
output layer, j is the index for neurons in
32:39.509 --> 32:46.509
the hidden layer and k is the index for the
input patterns and the weights for a typical
32:50.539 --> 32:57.539
weight, that is between the hidden layer and
the output layer is denoted or represented
32:58.519 --> 33:05.519
by wij. i is here and j refers to this particular
layer and here a typical weight between input
33:15.830 --> 33:22.830
layer and output layer is represented by wjk,
where j refers to index for middle layer and
33:25.590 --> 33:32.590
k is the index for input layer. This is what
we have said here. i, j and k are three different
33:37.190 --> 33:44.190
indexes for three different layers; input,
hidden and output. This is the weight; typical
33:45.830 --> 33:51.590
weight representation between hidden layer
and output layer and this is the typical weight
33:51.590 --> 33:58.590
representation for the weights between input
layer and output layer. Now, let us see the
34:00.429 --> 34:03.630
derivation of the back propagation algorithm.
34:03.630 --> 34:10.630
First what do we do? As I said, in the network
you allow the input to pass through the network
34:12.649 --> 34:19.649
and compute the response y1 y2 yn. To compute
y1 y2 yn, we assume that the hidden units
34:22.149 --> 34:29.149
had output also u1 u2 u3 and these output
are first computed; the outputs of the hidden
34:33.179 --> 34:40.179
units are first computed and using the output
of the hidden units, the output of the output
34:41.159 --> 34:48.159
layer are computed. Just for clarity, again
I say, all these neurons are sigmoidally activated.
34:52.599 --> 34:59.599
They use sigmoid activation function. We can
say that vj is 1 upon 1 plus e power minus
35:04.030 --> 35:11.030
hj, where hj is the total input reaching the
jth neuron of hidden layer. Similarly output
35:15.200 --> 35:20.990
of the ith neuron in the output layer is 1
upon 1 plus e to the power minus si, where
35:20.990 --> 35:27.990
si is equal to sigma into wijvj. si is the
total inputs reaching the ith neuron in the
35:32.060 --> 35:39.060
output layer. Here we are computing the cost
function for instantaneous update which is
35:40.319 --> 35:45.650
this value equal to half summation yi d minus
yi squared.
35:45.650 --> 35:52.650
So, the gradient descent rule is the same.
The weight update rule for weights between
35:53.150 --> 35:57.589
hidden layer and output layer, we will have
this particular …... That is the principle
35:57.589 --> 36:04.520
of gradient descent wij into t plus 1 equal
to wij t minus eta del E divided by del wij.
36:04.520 --> 36:08.240
Weight update rule for weights between the
input layer and the hidden layer is wjk t
36:08.240 --> 36:15.240
plus 1 equal to wjk t minus eta del E divided
by del wjk. How do you find now del E by del
36:17.940 --> 36:24.940
Wij and del E by del Wjk? This is what we
have to find out. Here we showed this diagram.
36:28.849 --> 36:35.720
Again, I drew in the black board for clarity,
because I would like to derive the whole thing
36:35.720 --> 36:42.720
in black board. You have input x1, x2 up to
xp and output y1 to yn, n different computational
36:46.940 --> 36:53.940
units and you have hidden layer with m hidden
neurons, whose outputs are v1, v2, vn and
36:56.770 --> 37:03.770
as I said, wij is the typical weight connecting
ith neuron in the output layer with jth neuron
37:04.869 --> 37:11.869
in the hidden layer. Similarly, wjk is the
typical weight between hidden layer and input
37:15.380 --> 37:21.839
layer. What will we do is in the first phase
we will allow this input to pass through the
37:21.839 --> 37:27.240
network and we compute what is y1. So, let
us compute that.
37:27.240 --> 37:34.240
What we are doing is we are computing the
output of a hidden neuron, vj. vj, the output
37:35.750 --> 37:42.750
of the jth neuron of the hidden layer is 1
upon 1 plus e to the power minus hj where
37:43.109 --> 37:50.109
hj is the total input reaching the jth neuron
and hj is wjk xk.
37:50.240 --> 37:57.240
If we go back, you see here vj. Say for example,
I want to compute v1. What do I do? I compute
37:59.660 --> 38:06.660
what is h1? h1 is sigma w1k into xk. That
is w11 into x1 plus w12 into x2 so on plus
38:17.290 --> 38:24.290
w1p into xp. This way we compute what is v1,
v2 and vm and after we computed v1, v2, vm,
38:34.089 --> 38:41.089
we will now compute what is y1 up to yn?
38:43.160 --> 38:50.160
This computation is yi is 1 upon 1 plus e
to the power minus si where si is sigma wij
38:52.950 --> 38:59.950
in to vj. This you can also verify in the
previous one. For example, I want to compute
39:03.880 --> 39:10.880
y1. First I will compute what s1 is and you
can say s1 is w11 v1 plus w12 v2 and so on
39:22.319 --> 39:29.319
plus w1m and vm. Once you compute s1, the
next …. is y1. y1 is 1 upon 1 plus e power
39:42.550 --> 39:49.550
minus si. This is a sigmoidal activation function.
We are done with how to compute y1 to yn given
39:54.770 --> 40:00.410
x1, x2 and xp using forward propagation.
40:00.410 --> 40:07.410
We go to computing the error, error at the
output. I am given actually y1 d, y2 d, y3
40:14.670 --> 40:21.670
d and yn d. This is given. I compute what
is y1, y2, y3 and yn? This is computed. The
40:37.359 --> 40:44.359
network computes y1, y2, y3 and yn given y1
d, y2 d, y3 d and yn d. So, I compute the
40:46.300 --> 40:53.300
cost function. The instantaneous cost function
is half y d i minus yi square sigma over i.
41:00.780 --> 41:07.780
I subtract from y desired y, square it and
add all the output units in the output layer.
41:12.900 --> 41:19.900
That is how I compute the cost function. This
is called instantaneous cost function
and this instantaneous function is computed
41:30.010 --> 41:37.010
for a specific pattern, for a given pattern
I compute this cost function.
41:38.200 --> 41:45.200
Now, I go to the next. Once I compute what
is E, I have to compute
for the weight, a typical weight between hidden
41:55.700 --> 42:02.700
layer and output layer wij, the weight whatever
was in previous iteration whatever is wij
42:11.380 --> 42:18.380
that has to be updated by adding a gradient
term.
42:20.130 --> 42:27.130
This is the weight update for weights between
hidden layer and output layer. All that you
42:46.880 --> 42:53.359
have to do is simple derivative principle.
How to differentiate given y and you know
42:53.359 --> 43:00.359
the function structure; so, you simply differentiate.
All that we have to do is that del E upon
43:01.720 --> 43:06.390
del wij. If I can compute I just have to put
it there in this equation. So this is equation
43:06.390 --> 43:13.390
1. I have to put that I differentiate del
by del wij and here I have half sigma over
43:18.609 --> 43:25.609
i y d i minus yi whole square. You can see
that only the error contribution due to ith
43:44.180 --> 43:51.180
neuron is a function of wij.
Hopefully, you are very clear. If I am considering
43:57.569 --> 44:04.569
this particular unit this is my typical weight,
which is w11 and I want to update w11. Then
44:06.369 --> 44:13.369
you should be very clear that only the error
computed here at y1 is a function of w11.
44:20.410 --> 44:27.410
You can very well see that yn is not a function
of w11. Once you are very clear about that
44:29.690 --> 44:36.690
I can write this one as del by del wij half
y d i minus yi whole square. Other terms are
44:48.190 --> 44:51.910
not relevant.
44:51.910 --> 44:58.910
We go again, we continue; del E by del wij,
we found out to be del by del wij half y d
45:12.589 --> 45:19.589
i minus yi square. You can check it; y i d
minus yi into, here minus, del yi by del wij.
45:32.260 --> 45:39.260
Hopefully, you are now clear with this expression.
This is, I can say, equation number 2. From
45:40.770 --> 45:47.770
equation number 2, we have to find out what
is del yi by del wij? What is yi? yi is 1
45:52.710 --> 45:59.710
upon 1 plus e to the power of minus si. So,
del yi by del wij has to be written like this;
46:04.740 --> 46:11.740
del yi upon del si into del si by del wij.
46:12.700 --> 46:19.700
We go to the next step. If yi is 1 upon 1
plus e to the power minus si, we have already
46:23.420 --> 46:30.420
discussed this kind of activation function,
if I differentiate si then I find the answer
46:38.660 --> 46:45.660
is yi 1 minus yi. This we have already said.
The rest is left is del si by del wij. To
46:52.030 --> 46:59.030
find out del si by del wij, I must write down
del by del wij and si is sigma wij into vj.
47:10.480 --> 47:17.480
This is simply vj because this is simply summation
of terms. So, with respect to wij, only vj
47:22.369 --> 47:28.290
comes out. If we want to differentiate with
respect to wij vj comes out.
47:28.290 --> 47:35.290
We have finally reached this solution that
is del E by del wij is minus y d i minus yi
47:44.440 --> 47:51.440
into del yi upon del wij which was equation
number 2. That reduces to minus y d i minus
47:58.559 --> 48:05.559
yi into yi 1 minus yi into vj. This is the
final expression; third. This term comes from
48:12.540 --> 48:19.540
del yi upon del si and this term comes from
del si upon del wij. What was our earlier
48:28.069 --> 48:35.069
update equation? wij t plus 1 equal to wij
t plus, this was actually minus, I am making
48:42.010 --> 48:49.010
plus because I am getting here minus eta del
E upon del wij. That was minus; here the term
48:51.900 --> 48:58.900
is minus so, this becomes plus eta y i d minus
yi into yi 1 minus yi into vj. This is your
49:10.630 --> 49:17.630
final equation for the weight update law for
weights between hidden layer and output layer.
49:22.450 --> 49:29.450
Finally, I write down the weight update equation
for the weights between the hidden layer and
49:32.290 --> 49:39.290
output layer is eta delta i vj where deltai
is yi 1 minus yi ……. This should not be
50:01.220 --> 50:08.220
surprising to you, because for single layer
neural network also for non-linear neural
50:08.470 --> 50:15.470
network we found delta to be of this form
and the input is vj. I will not now derive
50:17.349 --> 50:24.349
the next update equation that is required
for the weights between hidden layer and input
50:24.799 --> 50:31.799
layer which is wjk t plus 1, which also has
this similar formula that is wjk t. You can
50:33.790 --> 50:40.790
take just little pain to find out. This is
deltaj xk. You say that it is a very nice
50:41.970 --> 50:48.970
formula where deltaj is vj 1 minus vj sigma
deltai wij over i. I give this as an exercise
51:08.240 --> 51:15.240
for you to derive this weight update equation.
51:28.599 --> 51:33.609
We are done with the derivation of the back
propagation learning algorithm. I will just
51:33.609 --> 51:40.609
explain in the figure; we pass the information
from here to here. We have a general weight
51:51.130 --> 51:58.130
update algorithm for these weights here and
another general weight update algorithm for
51:58.700 --> 52:05.700
the layers here. So, that is the summary of
the weight update algorithm. What you learnt
52:06.339 --> 52:13.339
today is the gradient descend algorithm. The
derivation of this algorithm for multilayered
52:15.640 --> 52:22.640
neural network, we will apply that derivation
for learning the XOR map which you could not
52:24.599 --> 52:31.599
solve using single layer feed forward network
using non-linear activation function. There
52:33.930 --> 52:39.920
as usual this is our cost function and this
is our model of XOR network.
52:39.920 --> 52:46.520
We have two inputs x1 and x2. These are our
weights and we have put two bias units. This
52:46.520 --> 52:53.520
is the bias weight. That is one, I have a
fixed input 1 and weight is t1, here is t2.
52:56.950 --> 53:03.950
You can also say this is If you consider this
1 to be here and another input, external input
53:06.069 --> 53:12.819
then you can also easily say, this is w13
and this is w23. This is w13. It has been
53:12.819 --> 53:19.819
represented by t1 here and w23 is represented
by t2 here and these weights just for clarity,
53:20.220 --> 53:27.220
we have kept w 1 0 and w 2 0 and the bias
weight is kept as t0. That is input is 1 and
53:30.089 --> 53:37.089
the weight is t0 and the output is y. Now,
you update the weight. This is t0, w 2 0 and
53:39.280 --> 53:46.280
w 1 0 using the weight update algorithm for
wij and t2, t1, w11, w12, w22, you update
53:51.420 --> 53:55.950
using the weight update algorithm for wjk.
53:55.950 --> 54:02.950
We do that. To start with all these weights
you initialize between minus 0.1 to 0.1 and
54:10.990 --> 54:17.990
eta has been taken as 0.75 and let us take
the number of iterations as 50,000. What is
54:21.170 --> 54:28.170
iteration? I take all these patterns sequentially.
I have four patterns and 50,000 means I have
54:34.109 --> 54:41.109
taken these four patterns again and again
until 50,000 iterations are over. If you do
54:44.359 --> 54:51.359
that, you see the weight vector that is obtained
between hidden layer and input layer is this
54:58.289 --> 54:58.680
one.
54:58.680 --> 55:03.589
This is the output layer, the weight between
the output layer and the hidden layer. These
55:03.589 --> 55:09.940
are the bias weights t1 and t2. Again, you
consider this is for hidden layer and input
55:09.940 --> 55:16.940
layer and this is for output layer, the bias.
After training if you give this input, this
55:24.329 --> 55:31.329
is x1 and this is x2. This is my desired output
and actual output is very close to desired
55:32.500 --> 55:37.420
output. You can say this is almost exact.
55:37.420 --> 55:44.420
This is your error plot. You can easily see
that although we have taken 50,000 actually
55:46.880 --> 55:53.880
within 5000, the training is over. It is not
necessary actually to go up to this and in
55:57.410 --> 56:04.410
subsequent lecture we will say why we should
not train further after training is over because
56:04.530 --> 56:11.530
after 5000 literally there is no training
and this causes the problem of over generalization
56:13.289 --> 56:20.289
and that will be discussed in the next class.
Now here is assignment for you. The first
56:22.809 --> 56:29.809
one; please note down this linear dynamical
system where the coefficients are minus 0.75.
56:31.069 --> 56:38.069
This is second order system minus 0.65 and
here 1.25 your law that This is w1, w2 and
56:42.200 --> 56:49.200
w3. Assume w1, w2, w3 to be very small random
number. Use the training data using this model
56:51.599 --> 56:58.599
where u is a random number uniformly generated
between 0 to 1 and apply this input data,
57:04.740 --> 57:11.740
use this input data to train linear neural
network and you should finally get the answer;
57:12.589 --> 57:19.589
w1 should be minus 0.75, w2 should be minus
0.65 and w3 should be 1.25. This answer you
57:23.390 --> 57:24.450
should get.
57:24.450 --> 57:31.450
The second one maybe little difficult for
you; this you have to use a multilayered neural
57:33.789 --> 57:40.789
network, feed forward network. I suggest you
use around 10 hidden neurons. You have one
57:41.890 --> 57:48.890
output; the output layer has only 1 neuron,
input also has only 1 u k. Take only 1 neuron
57:53.329 --> 58:00.329
in the input layer. So, the input layer has
only one input. You can include one bias input.
58:04.299 --> 58:11.299
You generate the data for this, using this
actual model taking u k again randomly generated
58:14.930 --> 58:21.930
number from 0 to 1 and then normalize this
values y k and u k between 0 to 1 because
58:27.490 --> 58:33.770
your neural network has an activation function
whose output is between 0 to 1.
58:33.770 --> 58:40.770
You have to normalize output also between
0 to 1 and after that, you train. After training
58:42.440 --> 58:49.440
just like you could correlate in case of linear
dynamical system, you cannot correlate, because
58:49.510 --> 58:56.109
your neural network becomes a black box; we
will discuss this more but what you can verify
58:56.109 --> 59:03.109
is you give new data to this actual model,
new input data and for that new input data
59:06.520 --> 59:13.410
find out what is the output. Giving this new
input data to your trained neural network,
59:13.410 --> 59:20.410
you should be able to get what is the desired
output. Good bye. We will meet again in next
59:27.270 --> 59:30.930
class.
360