WEBVTT
00:15.179 --> 00:22.179
This
is the third lecture of module 1 – Neural
Networks. We have already discussed the linear
00:30.970 --> 00:37.970
neural network and system identification in
linear neural network; then, we talked about
00:37.980 --> 00:44.980
the feed forward neural network, that is,
the multilayer network, which has the capability
00:45.190 --> 00:52.190
of approximating any nonlinear function. We
learnt in the second lecture how to derive
00:54.510 --> 01:00.850
the back propagation algorithm for the feed
forward neural network that can approximate
01:00.850 --> 01:03.100
any nonlinear function.
01:03.100 --> 01:10.100
Today, in this third lecture, we will again
review the back propagation algorithm that
01:11.420 --> 01:18.420
we derived in the last lecture; then, we generalize
the delta rule and we will understand this
01:18.649 --> 01:25.649
concept today. Then, system identification
using this back propagation algorithm; then
01:27.930 --> 01:34.030
two different variations in back propagation
algorithm, that is, adding a momentum and
01:34.030 --> 01:40.590
adaptive learning rate. Today, we will just
have a very heuristic version of the adaptive
01:40.590 --> 01:47.590
learning rate. Probably after three or four
lectures, you will have a detailed analysis
01:50.780 --> 01:56.100
on how to comprehensively design adaptive
learning rate for back propagation network,
01:56.100 --> 02:03.100
the reason being that this particular analysis
would need the concepts of stability for nonlinear
02:06.299 --> 02:13.299
systems, that is, Lyapunov function and Lyapunov
function-based stability theory. These notions
02:14.540 --> 02:21.540
have to be reviewed, before we can talk about
a very comprehensive method of computing adaptive
02:23.559 --> 02:29.599
learning rate for back propagation network.
02:29.599 --> 02:33.829
We have already discussed about multilayer
feed forward network. We said that you have
02:33.829 --> 02:40.829
an input layer and then you have many layers.
These layers have many neurons. These are
02:52.299 --> 02:59.299
all input signals coming here and they are
fanned out. This is your output layer and
03:05.519 --> 03:12.519
these all are hidden layers. This is called
the multilayer feed forward network. It has
03:16.510 --> 03:23.040
more hidden layers of computation units, that
is, these hidden layers can be more than one.
03:23.040 --> 03:30.040
You can have one, two, three, four – as
many as we want. FNN stands for feed forward
03:30.299 --> 03:37.299
neural network. In this feed forward neural
network, the connections are allowed from
03:40.290 --> 03:47.290
one layer to the succeeding layer in the forward
direction and not in the backward direction.
03:47.469 --> 03:54.469
I cannot have a connection like this. This
is not allowed, this connection is not allowed
03:54.659 --> 03:58.329
– from this layer to this layer, but it
is allowed from this layer to this layer.
03:58.329 --> 04:05.329
The hidden layer gets input from input layer
and gives output to the next hidden layer
04:05.459 --> 04:12.459
or output layer after internal computation.
This is the basic structure of feed forward
04:13.019 --> 04:15.619
network.
04:15.619 --> 04:21.489
This is the notation we have used in the last
class about the back propagation network.
04:21.489 --> 04:28.489
x is the input, v is the hidden node output,
y is the actual output, wij is the weight
04:31.680 --> 04:37.699
connection between i th unit of output layer
and j th unit of hidden layer, wjk is weight
04:37.699 --> 04:41.620
connecting j th unit of hidden layer and k
th unit of input layer.
04:41.620 --> 04:48.620
You can easily see that x is the input, which
is a p-dimensional vector. The summation of
04:52.409 --> 04:59.409
weights into the input signals are all summed
here and that sum is h. If I say hj, it is
05:05.039 --> 05:12.039
simply sigma over k wjk xk – this is your
hj. After you go through this sigmoidal activation,
05:22.090 --> 05:29.090
that becomes your v, so vj is simply 1 upon
1 plus e to the power of minus hj – this
05:30.110 --> 05:37.110
is after sigmoidal activation. Again as usual,
output y is an n-dimensional output. yi is….
05:47.389 --> 05:51.659
You add all v1. We have shown this here.
05:51.659 --> 05:58.659
vj is 1 upon 1 plus e to the power of minus
hj, where hj is wjk xk and k equal to 1 to
06:01.400 --> 06:08.199
p. Similarly, final response yi is phi gi,
which is 1 upon 1 plus e to the power of minus
06:08.199 --> 06:15.199
gi, where gi is wij vj. If I go back here,
I write here g and that means gi is simply
06:21.300 --> 06:28.300
sigma wij into vj. This is sigma. What you
are seeing is that wij is the connection weights
06:36.379 --> 06:43.259
between the output layer and the hidden layer
and wjk is the typical weight between the
06:43.259 --> 06:50.259
hidden layer and the input layer. This is
the notation we used in the last class. This
06:51.199 --> 06:55.749
is called forward phase. In the forward phase,
we computed the output of the hidden unit
06:55.749 --> 06:59.520
and output of the output unit.
06:59.520 --> 07:06.520
Then, back propagation. Given the input output
patterns, given input x, I know what the desired
07:10.499 --> 07:17.499
target – y d is. The network has a response
yi and so, I compute a cost function E of
07:18.710 --> 07:25.550
t, which is a quadratic cost function. I use
the gradient descent rule, which is of the
07:25.550 --> 07:32.060
nature wij into t plus 1 is wij of t minus
eta into dow E of t upon dow wij of t. We
07:32.060 --> 07:37.469
have already discussed a lot about gradient
descent; we apply this gradient descent to
07:37.469 --> 07:40.110
compute the error back propagation.
07:40.110 --> 07:47.110
During that, we update the weights connecting
the output layer and hidden layer. The weight
07:47.430 --> 07:52.849
that we are proposing, which is wij You can
see that the weight wij that we are talking
07:52.849 --> 07:59.849
about is this weight. Once I transfer the
signal from x to y, then my target is there.
08:02.139 --> 08:09.139
I compute the target error here and I back
propagate the target error. This is my e here
08:09.650 --> 08:15.680
at the target and this target error is back
propagated. Through back propagation, I update
08:15.680 --> 08:22.680
what is wij – it has to be updated.
08:24.999 --> 08:31.669
To compute what is the weight update in wij,
we need to compute what is dow E of t upon
08:31.669 --> 08:38.669
dow wij and that gives me the formula dow
Ei of t upon dow yi into dow yi upon dow wij
08:51.260 --> 08:58.260
of t. How do I compute dow E upon dow wij?
I know that E is summation of Ei. Hence, differentiating
09:02.639 --> 09:09.639
E with respect to wij means I differentiate
Ei with respect to wij and take a sum. Then,
09:13.600 --> 09:20.600
individual Ei is a function of yi. So, I differentiate
Ei with respect to dow yi and then, I differentiate
09:23.089 --> 09:30.089
yi, which is a function of wij. That is how
I compute this partial derivative.
09:30.820 --> 09:37.820
Here, this is dow E upon dow yi of t, which
is half, because, this function Ei is simply
09:42.040 --> 09:49.040
yi d minus yi whole square – this is Ei.
Hence, if you look at this here, it is half
09:51.010 --> 09:58.010
into 2 yi d minus yi into… when you differentiate
with respect to yi, we get minus 1 here. That
10:03.750 --> 10:10.750
is very clear. Now, dow yi upon dow wij is
dow yi upon dow gi because, we found out that
10:12.360 --> 10:19.360
yi is 1 upon 1 plus e to the power of minus
gi. We have already shown that here: yi is
10:20.050 --> 10:27.050
a function of gi. While differentiating, I
will obviously differentiate yi with respect
10:28.959 --> 10:33.220
to gi and then gi with wij.
10:33.220 --> 10:40.220
Doing that way, I finally get this expression
delta wij is eta into yi d minus yi into yi
10:45.009 --> 10:52.009
into 1 minus yi into vj. We wrote the weight
update algorithm in a generalized format as
10:55.000 --> 11:02.000
wij of t plus 1 is wij of t plus eta deltai
vj and vj is the input. You can see easily
11:05.250 --> 11:12.250
here that when I am updating weights for wij,
vj is my input to the weights in this layer.
11:14.630 --> 11:21.630
vj is the input to the weights in this layer.
My error is e, which is… The back propagated
11:25.170 --> 11:32.170
error that I am talking of here is represented
as deltai. This deltai, which is the error
11:33.910 --> 11:40.910
back propagated from the output layer, is
defined as yi into 1 minus yi into the error
11:43.290 --> 11:50.290
at the output, which is yi d minus yi. This
called the back propagation algorithm. My
11:50.329 --> 11:56.560
weights are updated and the new weight is
old weight plus eta, error back propagated
11:56.560 --> 12:03.560
into the input signal to
the connection.
12:05.009 --> 12:12.009
Similarly, we derived for wij. Now, we are
talking about wjk. wjk was the weight vector
12:14.959 --> 12:21.959
or weight matrix between the hidden layer
and input layer. We have already derived it
12:26.819 --> 12:31.819
and I will not go into a detailed discussion.
What we did is that when I differentiate E
12:31.819 --> 12:38.819
with respect to jk, then I differentiate Ei
with respect to yi and yi with respect to
12:40.490 --> 12:43.620
wjk in a similar manner.
12:43.620 --> 12:50.620
Ultimately, we finally get a relationship,
which is again a generalized form: wjk of
12:52.880 --> 12:59.880
t plus 1 is wjk of t plus eta deltaj xk. This
similar format you see, the earlier one was
13:01.470 --> 13:08.470
wij of t plus 1 was wij of t plus eta deltaj
vj. This is the weight update between output
13:17.800 --> 13:24.800
layer and hidden layer and this is the hidden
layer and input layer, where the input is
13:24.990 --> 13:31.990
xk in the input layer and deltaj is the error
back propagated from the output layer. You
13:34.060 --> 13:41.060
see that deltai is the error back propagated
in the output layer and we multiply the corresponding
13:43.199 --> 13:49.690
weights wij and do the summation. The quantity
deltaj is the total error back propagated
13:49.690 --> 13:56.690
from the output layer to the layer between
input layer and hidden layer into vj into
13:57.769 --> 14:04.769
1 minus vj. So, deltaj is the back propagated
error to the layer that is situated between
14:05.290 --> 14:12.290
the hidden layer and input layer. Now, I will
summarize what we saw in the back propagation
14:14.360 --> 14:20.339
algorithm.
14:20.339 --> 14:27.339
We have a typical input layer – vector x,
which is p by 1 and then a single hidden layer
14:31.680 --> 14:38.680
whose output is v, which is m by 1. This is
p. Then, there is an output layer whose output
14:43.230 --> 14:50.230
is y and this is n by 1. Let us consider a
j th computational unit in the hidden layer,
14:56.940 --> 15:03.940
i th computational unit in the output layer,
and k th computational unit in the input layer.
15:09.110 --> 15:16.110
What we saw in the back propagation algorithm
is wij of t plus 1 is wij of t plus eta deltai
15:29.589 --> 15:36.589
vj. What is vj? What is wij? This is my connection
– this is wij. This particular weight is
15:43.449 --> 15:50.449
updated based on the error on a single data
set, based on the instantaneous update rule.
15:55.110 --> 16:02.110
What we do is we update wij based on its previous
value and the connection is…. eta is the
16:05.689 --> 16:12.689
learning
rate and normally, this value eta is 0 to
1. deltai is the error back propagated from
16:28.339 --> 16:33.769
the output to this output layer. I normally
say that this is the second layer and this
16:33.769 --> 16:40.269
is the first layer. Let me say that this is
the second layer.
16:40.269 --> 16:47.269
We found out deltai, the weight to be yi into
1 minus yi into the error that is yi d minus
16:55.850 --> 17:02.850
yi. This is my error at the output. This multiplied
by yi into 1 minus yi deltai. vj is the input
17:09.060 --> 17:16.060
– this is vj. If you look at wij, the input
is vj to this connection and the output of
17:19.780 --> 17:26.780
this unit is yi. Given that, the deltai has
a unique structure. deltai is yi into 1 minus
17:28.530 --> 17:35.530
yi into yi d minus yi and the update rule
also has a unique structure because, vj is
17:36.600 --> 17:43.600
the input to the connection, delta is the
error being back propagated through this weight
17:45.640 --> 17:52.640
and this is called the delta rule.
17:52.750 --> 17:59.750
The weight update rule has a very simple form,
which is delta, which is error back propagated,
18:02.450 --> 18:09.450
into vj. We will see the same thing here.
This weight is wjk by convention we have already
18:10.710 --> 18:17.710
discussed that and this is my xk – the output
of this unit. The output of this unit is vj.
18:18.490 --> 18:25.490
We saw that the weight update law is wjk of
t plus 1 is wjk of t plus eta deltaj into
18:36.710 --> 18:43.710
xk. You see that this and this have the same
form and that is why this is called delta
18:46.210 --> 18:53.210
rule. Of course, the deltaj has a different
value than deltak, but it also has a unique
18:55.660 --> 19:02.660
structure. deltaj is vj into 1 minus vj into
sigma over i deltai wij.
19:06.240 --> 19:13.240
I add i, the output1 to n deltai wij, multiply
vj and 1 minus vj. This is my error being
19:19.450 --> 19:26.450
propagated from this unit over this. This
is the error deltaj. So this is deltaj. deltaj
19:28.130 --> 19:35.130
being back propagated in this particular connection
output. It is very important to understand
19:35.870 --> 19:41.720
this. deltaj into xk, where xk is the input
to this connection. You can easily see that
19:41.720 --> 19:48.500
the error back propagated in that particular
connection weight into the input to that connection
19:48.500 --> 19:55.500
weight, if you multiply by the learning rate,
that gives you the back propagation rule.
19:57.190 --> 20:03.990
It is a very simple rule. Using this delta
rule concept, we can write down this back
20:03.990 --> 20:10.990
propagation algorithm for any layer in the
network. That is what we will write now.
20:12.200 --> 20:19.200
Generalized delta rule: We derived this delta
rule for a single hidden layer. Now, we will
20:19.310 --> 20:26.310
generalize this delta rule for any number
of hidden layers in a feed forward network.
20:27.580 --> 20:34.580
That is why it is called a generalized delta
rule. You see here we have a multilayered
20:37.410 --> 20:43.010
neural network and this multilayered neural
network has L layers. This is the input layer,
20:43.010 --> 20:50.010
first layer, second layer, third layer and
L th layer. These are all feed forward networks,
20:50.800 --> 20:57.800
these are all feed forward networks and these
are all feed forward networks.
20:58.130 --> 21:02.400
Like we earlier assumed, the indexes for each
layer are i, j and k; i is for output layer,
21:02.400 --> 21:08.510
j is for hidden layer and k is for input layer.
Similarly, here, the index we have generalized
21:08.510 --> 21:15.510
iL for L th layer and similarly, ii is for
i th layer, i3 is for the third layer, i2
21:15.900 --> 21:22.900
is for the second layer and obviously, for
the input layer, the index is i0. With this
21:24.470 --> 21:30.590
particular convention, we can now easily write
down using the same – whatever we discussed
21:30.590 --> 21:37.590
here, the normal delta rule. This is called
delta rule because, our weight update algorithm
21:40.190 --> 21:45.590
consists of the error back propagated the
input to the specific connection weight. So,
21:45.590 --> 21:52.590
this is the delta rule. Using that, we can
now write down the update rule.
21:56.180 --> 22:03.180
You see here, this is
the typical connection weight between l th
and l minus 1 th layer. What I am talking
22:15.300 --> 22:22.300
about here is this is my wil, il minus 1.
These are the weights between the L th layer
22:29.980 --> 22:36.980
and the preceding layer. We are talking about
how to update the weight in this layer. Once
22:40.710 --> 22:47.710
I pass the input this network, compute the
output at the output layer, compare with the
22:48.710 --> 22:55.710
desired data, then I want to update this weight.
It is the same formula, that is, eta deltail
23:00.380 --> 23:07.380
into vi l minus 1. This is the input to this
connection weight and this is the error being
23:09.090 --> 23:16.070
back propagated. Obviously, we can easily
see that deltail lowercase l and uppercase
23:16.070 --> 23:23.070
L are the same – we are not discriminating.
deltail is y il d minus y il into this is
23:30.020 --> 23:37.020
our error at the output and you have to multiply
this term yil into 1 minus yil. That transforms
23:41.460 --> 23:46.110
the error at the output to the… when it
is back propagated.
23:46.110 --> 23:53.110
This deltail is here. This is the error back
propagated in this immediate layer; that is
24:03.030 --> 24:10.030
the L th layer. This is for the output layer.
This l represents any layer and deltail is
24:23.520 --> 24:30.520
the… for capital L, that is, the output
layer. We have made some changes here. Let
24:34.300 --> 24:41.300
Wil, i l minus 1 denote the synaptic weight
connecting the i th neuron of layer l to that
24:43.330 --> 24:50.330
of layer l minus 1. This l is any typical
layer. Considering sigmoidal function as the
24:53.150 --> 24:58.420
activation function for each layer, the weight
update law can be written as….
24:58.420 --> 25:05.420
This is for any layer. This particular weight
is the weight between any of these layers.
25:10.170 --> 25:17.170
For example, if I take L is equal to 2, then
this is the layer; if L is equal to 3, then
25:18.340 --> 25:25.340
this is the layer; if L is equal to 1, then
this is the layer; if l is equal to capital
25:26.110 --> 25:33.110
L, then this is the layer. This is the generalized
delta rule. Given the delta at the output
25:51.320 --> 25:58.320
layer, we can easily compute corresponding
delta in all proceeding layers subsequently.
26:00.050 --> 26:06.790
Then, we can update all the weights in all
the layers using this concept of delta rule.
26:06.790 --> 26:13.790
I will give you a simple example. Let us explain
this generalized delta rule through another
26:17.790 --> 26:19.800
example – we will talk about a four-layer
network.
26:19.800 --> 26:22.390
This is the input layer; this is the first
layer, second layer, third layer and fourth
26:22.390 --> 26:27.710
layer. Obviously, the weight between the third
layer and fourth layer will be represented
26:27.710 --> 26:34.710
by Wi4, i3; the weight between the third layer
and second layer is Wi3, i2; the weight between
26:35.100 --> 26:41.960
the first layer and second layer is Wi2, i1,
and the weight between input layer and first
26:41.960 --> 26:48.960
layer is Wi1, i0. This is the convention that
we are following. Given the error, that is,
26:54.860 --> 27:01.860
y d minus y at the output layer, how do we
go?
27:04.450 --> 27:11.450
Naturally, the element of y is yi4 because,
i4 is the index for the output layer. When
27:15.220 --> 27:22.220
I say y is a vector, then yi4 is an element
of the vector y. So, y is the output vector.
27:25.240 --> 27:30.590
This is the weight update law for the output
layer, that is, the fourth layer. This is
27:30.590 --> 27:36.110
the first layer and then second layer. When
I talk in terms of connection weight, this
27:36.110 --> 27:40.480
is the first layer of connection weight, second
layer of connection weight, third layer of
27:40.480 --> 27:46.720
connection weight and fourth layer of connection
weight. As many layers we have, we have that
27:46.720 --> 27:50.160
many layers of connection weights.
27:50.160 --> 27:57.150
What you are seeing here, as we saw earlier,
is that Wi4, i3, the typical weight in the
27:57.150 --> 28:03.330
fourth layer, is updated using the generalized
delta rule, which is eta into delta, which
28:03.330 --> 28:10.330
is the error back propagated, and vi3, which
is the input to the connection weight Wi4,
28:13.890 --> 28:20.890
i3. What is deltai4? deltai4 is obviously
yi4 d minus yi4 into yi4 into 1 minus yi4.
28:25.990 --> 28:32.540
This is my error that I computed here. This
is transformed and we get this because of
28:32.540 --> 28:38.340
the sigmoidal activation function. If we have
a linear activation function, this particular
28:38.340 --> 28:45.340
term will not be there. This is my update
rule for the last layer or the output of fourth
28:47.260 --> 28:54.260
layer. Similarly, for the third layer, it
is eta deltai3 vi2.
29:01.600 --> 29:08.600
This is my deltai4 here; this is delta i3,
deltai2 and deltai1. I computed what is deltai4.
29:14.410 --> 29:21.410
Based on deltai4, I can compute what deltai3
is. This is what I will do. deltai3 is vi3
29:22.010 --> 29:29.010
into 1 minus vi3 into delta i4 Wi4, i3. This
is actually i4 equal to 1 to n or the number
29:38.730 --> 29:45.730
of units in my L th layer. Let me say that
this is n4 and similarly, this is n3. The
29:57.440 --> 30:04.440
weight update law for the second layer will
be eta deltai2 vi1. We can easily check it
30:08.380 --> 30:15.380
again. For this, it is deltai4 into vi3; this
is delta i3 vi2; this is deltai2 vi1 and this
30:19.890 --> 30:26.890
is deltai1 and xi0 because it is the input
layer here – xi0. This is what we will have
30:30.730 --> 30:37.730
now and this is what you can easily see. eta
deltai3 vi2 is for third layer, deltai2 vi1
30:39.220 --> 30:46.220
is for second layer, and this is your error
back propagated. deltai2 is computed based
30:49.060 --> 30:52.820
on deltai3 and deltai3 is computed based on
deltai4.
30:52.820 --> 30:59.820
Similarly, finally, deltai1 is computed based
on deltai2. This is your first layer. Given
31:05.160 --> 31:11.630
any layer, using generalized delta rule, we
can write the back propagation algorithm.
31:11.630 --> 31:18.480
We do not have to derive again making all
those complicated forward response equations
31:18.480 --> 31:24.160
and then differentiating the error with respect
to each typical weight and finding all those
31:24.160 --> 31:29.770
– there is no need actually. This rule is
actually generalized and we do not have to
31:29.770 --> 31:36.770
compute. We are actually formula. These represent
the right formula.
31:39.800 --> 31:46.800
We are done with the first part that we wanted
to explain today – the generalized delta
31:48.630 --> 31:55.630
rule. In the generalized delta rule, the concept
was that we wanted a recursive formula for
32:00.400 --> 32:03.350
a back propagation learning algorithm.
32:03.350 --> 32:08.010
It allows the error signal of a lower layer
to be computed as a linear combination of
32:08.010 --> 32:14.270
the error signal of the upper layer. In this
manner, the error signals are back propagated
32:14.270 --> 32:21.270
through all the layers from the top down,
that is, from the last layer -the output layer,
32:21.700 --> 32:28.700
backward until we reach the input layer. We
can compute the error back propagation, which
32:29.460 --> 32:36.460
is in terms of delta, which is here.
32:37.870 --> 32:44.870
We have to compute here what is delta iL for
the output layer and based on that, we finally
32:45.080 --> 32:52.080
compute what is delta i1. deltai1 is computed
based on deltai2, deltai2 is computed based
32:53.540 --> 33:00.270
on deltai3, deltai3 is computed based on deltai4
and so on. Finally, delta iL minus 1 is computed
33:00.270 --> 33:07.270
based on what is delta iL. The general form
is that this is the original form. This is
33:12.780 --> 33:19.160
a typical weight between the l th layer and
l minus 1 th layer.
33:19.160 --> 33:26.160
The update rule or delta w is eta deltail
into vi l minus 1. What you are seeing is
33:27.610 --> 33:34.610
eta deltail vil minus 1. This is the input
to the connection and this is the error back
33:34.840 --> 33:41.840
propagated to the layer l. Using this concept,
we will now demonstrate the application of
33:46.130 --> 33:53.130
the back propagation algorithm for system
identification. Since it is a control course,
33:53.280 --> 34:00.280
you would always use system identification
while demonstrating any application of a specific
34:07.060 --> 34:14.060
neural network. A practical system, the surge
tank system has been taken for simulation.
34:17.460 --> 34:24.460
We will identify the surge tank system using
a neural network or we will model the surge
34:24.720 --> 34:25.619
tank. So, what is the surge tank?
34:25.619 --> 34:32.619
A surge tank- what you are looking at is that.
Normally, a surge tank is used to control
34:34.750 --> 34:41.750
the hydraulic transients and pressure changes.
You see this is a big reservoir. In this reservoir,
34:42.509 --> 34:49.509
the water level sometimes increases suddenly
or suddenly decreases, but this reservoir
34:49.980 --> 34:56.980
is connected particularly to a hydro power
plant, where the turbines have to move in
34:57.180 --> 35:04.180
a specific speed. So, the flow rate here has
to be constant. If the water level suddenly
35:06.660 --> 35:13.400
increases or suddenly decreases in this reservoir,
then this flow rate will be affected. To maintain
35:13.400 --> 35:18.009
the flow rate as constant, the surge tank
is used.
35:18.009 --> 35:25.009
What happens? Whatever the disturbance in
the reservoir level, it can be controlled
35:26.420 --> 35:33.420
using this surge tank. The extra pressure
you have, the extra flow can be pushed to
35:37.839 --> 35:44.839
the surge tank. Normally, the surge tank has
a nonlinear structure level. For a surge tank,
35:53.039 --> 36:00.039
the structure is like this. This is a cylindrical
structure. That means everywhere, the diameter
36:05.039 --> 36:11.630
of this surge tank… for example, if this
is circular, then the diameter is constant.
36:11.630 --> 36:18.630
But if the surge tank has a nonlinear structure,
then given the specific flow into this surge
36:24.670 --> 36:31.670
tank, the water level in the surge tank to
the flow rate into the surge tank will bear
36:34.900 --> 36:37.390
a nonlinear relationship.
We will now try to model this surge tank,
36:37.390 --> 36:44.390
because, unless we know how to model the surge
tank, we cannot design; because we must know
36:47.750 --> 36:49.589
how the water level in this surge tank is
increasing or decreasing, given a specific
36:49.589 --> 36:56.589
water flow into this surge tank. Based on
that, we will design a surge tank and a specific
37:00.140 --> 37:07.140
dimension of the surge tank, that is, how
the diameter will vary – linearly or nonlinearly,
37:07.529 --> 37:14.119
etc. We are just taking a nonlinear surge
tank model. This is a schematic diagram of
37:14.119 --> 37:20.099
a surge tank. When the pressure increases
due to sudden change in the flow from the
37:20.099 --> 37:25.190
reservoir, the level of the surge tank increases;
thus, controlling the flow as well as pressure
37:25.190 --> 37:31.730
to the connecting system; this flow is controlled
accordingly.
37:31.730 --> 37:38.730
Typically, we are not interested now in how
we derive the model of a surge tank. What
37:40.690 --> 37:47.690
we are interested in is given a nonlinear
structure-based surge tank, this is the dynamics
37:52.200 --> 37:58.099
of the surge tank, that is, h of t plus 1
is equal to h of t plus T, which is the sampling
37:58.099 --> 38:05.099
time, and this particular nonlinear … you
can see easily that this is a nonlinear function,
38:06.500 --> 38:13.500
because, this is square root of h and also,
u is multiplied with 1 upon square root of
38:16.390 --> 38:19.920
h and hence, this is a nonlinear function.
38:19.920 --> 38:26.920
t is the discrete time step, that is, the
sample instant per second, T is the sampling
38:28.339 --> 38:34.829
time, u of t is input flow, h of t is the
liquid level in the surge tank and g is the
38:34.829 --> 38:41.829
acceleration due to gravity. This is our model
of the surge tank. Why I am considering this
38:42.440 --> 38:48.259
model is because, in simulation, we do not
collect data from the actual surge tank – we
38:48.259 --> 38:55.259
collect data from a mathematical model in
simulation. Using that data, we will represent
39:00.240 --> 39:06.220
those data in terms of neural network model.
This neural network is the back propagation
39:06.220 --> 39:08.279
network that we have just earned.
39:08.279 --> 39:15.279
What we have done is that we have taken this
model and we generated data using this model,
39:19.730 --> 39:26.730
not using an actual surge tank – an actual
surge tank is there in a field, probably in
39:27.240 --> 39:34.240
a hydro power plant. We want to demonstrate
the application of a back propagation algorithm
39:34.420 --> 39:41.420
to a specific system modeling. This is the
mathematical model of the surge tank and using
39:42.630 --> 39:48.930
this mathematical model, we generate data.
Using that data, we create a back propagation
39:48.930 --> 39:55.930
network. What we have done here is that we
have taken sampling time capital T is 0.01
39:57.839 --> 39:59.650
in the mathematical model.
39:59.650 --> 40:05.089
150 data have been generated using the system
equation. We saw that equation. This is the
40:05.089 --> 40:12.089
equation. Using this equation, we generated
150 data. What is u of t? Given u of t, what
40:13.549 --> 40:20.549
is h of t plus 1? This is our input that we
have selected and this is the corresponding
40:23.470 --> 40:30.470
output. In that equation, you set T equal
to 0.01 and g is the normal acceleration due
40:30.509 --> 40:37.509
to gravity, which is 9.81 meter per Second
Square. Once you give those parameters, here
40:40.059 --> 40:43.720
all the others are known.
40:43.720 --> 40:50.720
T is given as 0.01, you are giving g, so it
is simply u of t and h of t. You have given
40:51.670 --> 40:58.670
u of t is equal to using this particular equation.
We used the Runge–Kutta fourth order to
40:58.950 --> 41:05.950
generate the response. This response is generated
using Runge–Kutta fourth order equation.
41:07.329 --> 41:14.329
Actually, there is no need, this is a discrete
time recursive equation. If it is a differential
41:16.720 --> 41:21.720
equation, then we would have used the Runge–Kutta
equation but it is a simple recursive equation
41:21.720 --> 41:28.720
and there is no need for the Runge-Kutta equation.
We generated this data given this input and
41:33.069 --> 41:38.349
then we selected a network structure.
41:38.349 --> 41:45.349
We took a back propagation network having
two hidden layers. Each hidden layer has 15
41:47.640 --> 41:54.640
neurons or computational units and the number
of outputs is the… this is desired. We represent
42:02.190 --> 42:09.190
data in this term, u of t and h desire t plus
1. We get this from the system equation. This
42:17.059 --> 42:24.059
is my desired output: h d t plus 1, given
u of t. But you can see here, how we select.
42:29.880 --> 42:36.880
You see here in the back propagation network
feed forward network, if I give u of t, I
42:37.299 --> 42:42.109
cannot complete h of t plus 1. You can easily
see that h of t plus 1 is a function of h
42:42.109 --> 42:49.109
of t as well as u of t; h of t plus 1 is a
function of h of t and u of t. Obviously,
42:56.829 --> 43:03.829
I take this h of t from the system, because
at instant t, I am able to measure what is
43:12.670 --> 43:19.670
the liquid level in the surge tank; I am measuring
that and so I know the actual value of h d
43:19.940 --> 43:26.940
t; u of t is the flow rate that is going to
the surge tank, which I am able to measure.
43:29.700 --> 43:36.700
Given the actual liquid level of the surge
tank at time t and the flow rate u of t, I
43:38.259 --> 43:45.259
have to predict what h of t plus 1 is. This
is my model.
43:45.440 --> 43:52.440
Obviously, for my system, the input is u of
t and h d t, which is written here, this is
43:58.259 --> 44:04.819
actually d here the desired, which has been
computed from the actual system. The number
44:04.819 --> 44:11.819
of inputs that I have is 2 and output is my
h desired t plus 1. Given u of t and h d t,
44:13.339 --> 44:20.339
I must predict what is h d t plus 1 – that
is the objective. We have fixed the learning
44:21.130 --> 44:25.960
rate to be 0.2, number of input/output data
is 150, the activation function for hidden
44:25.960 --> 44:29.890
layers is sigmoidal and we have taken the
activation function of the output layer to
44:29.890 --> 44:36.890
be linear. You have two hidden layers, so
sigmoid and the output , but you can select
44:37.640 --> 44:40.430
your own architecture and do the same thing.
44:40.430 --> 44:47.430
In fact, this can be easily done using a single
hidden layer and output layer with sigmoidal
44:48.220 --> 44:55.220
activation function – we can easily do it.
What we do is that we have 150 training data,
44:56.519 --> 45:03.519
we selected network architecture; we observed
what the input to the network is and what
45:04.829 --> 45:11.829
the output is and then we update the equation
recursively – update the weights. In the
45:13.210 --> 45:18.880
beginning, all the weights in the network
were randomly initialized – we took very
45:18.880 --> 45:19.599
small values.
45:19.599 --> 45:26.599
You will see that the error has been reduced
to less than 0.004 after 20,000 epochs. What
45:34.999 --> 45:41.999
is an epoch? Each epoch is 150; that is, in
my data set, I have 150 different sets of
45:45.819 --> 45:52.819
data. One data is h d t u t and h d t plus
1. This is my single data set at t and t is
46:00.880 --> 46:07.880
equal to 1 to 150. Like that, I have 150 data
sets. I give to my network u t and h d t,
46:09.660 --> 46:16.660
compute what is h d t plus 1, compare it with
h d t plus 1, back propagate the error and
46:16.759 --> 46:23.420
update the weight; doing that, I am able to
reduce or make my cost function 0.004.
46:23.420 --> 46:30.420
After training is over, what did we do? We
give to the network a different control input
46:37.420 --> 46:44.420
and compare it with the actual output, that
is, this control input was given to this equation
46:45.549 --> 46:52.059
and then we computed what is the output. This
is actual output based on this equation. We
46:52.059 --> 46:59.059
then gave it to the neural network that has
been trained. Once we gave this input to the
47:01.249 --> 47:07.220
neural network, the response and the actual
output based on the equation are actually
47:07.220 --> 47:12.589
matching – you can see that.
You can see that there is a red line here
47:12.589 --> 47:19.589
and there is a green line. The red line is
the desired based on the equation. Red is
47:22.900 --> 47:29.900
computed from the equation and green is from
the neural network. Both the red and green
47:33.859 --> 47:40.859
are very much following each other. In essence,
we can say that the actual system identification
47:42.890 --> 47:49.890
has been done; that is, the equation that
represents a model of a specific surge tank
47:52.339 --> 47:59.339
has been again represented using a back propagation
network having two hidden layers. Now, we
48:02.630 --> 48:07.809
will look into some other aspects of back
propagation.
48:07.809 --> 48:14.809
As I already told you, some of the problems
in back propagation are that the normal back
48:15.069 --> 48:20.819
propagation can only optimize if my cost function
is a single minima. Normally, in a cost function,
48:20.819 --> 48:27.819
you will have many local minima as well as
global minima and so, I may reach here or
48:28.989 --> 48:35.450
I may reach here. If I start from here, I
will reach here; if my initial weights are
48:35.450 --> 48:40.369
here, then I will reach here, because, that
is the gradient descent; if my initial weights
48:40.369 --> 48:47.369
are here, I will reach here. This is the limitation
of back propagation – reaching global minimum
48:48.819 --> 48:54.720
in back propagation is not guaranteed.
48:54.720 --> 49:01.720
That is why researchers have introduced several
variations in the back propagation algorithm.
49:02.480 --> 49:07.999
First is to improve the convergence speed,
then avoid local minima and the generalizing
49:07.999 --> 49:12.839
capability. Generalizing capability means
that if I have trained my network for a specific
49:12.839 --> 49:19.220
data set, it should be able to predict what
the output is for a new data set; that is
49:19.220 --> 49:26.220
called generalization. One of the ways to
improve the convergence speed is by adding
49:27.359 --> 49:29.029
a momentum.
49:29.029 --> 49:35.529
How do we add a momentum? This is our equation.
It is actually a heuristic approach, but we
49:35.529 --> 49:41.970
can analyze this equation just immediately.
This is our normal equation: w t plus 1 is
49:41.970 --> 49:46.809
w of t minus eta into del E by del w of t
– this is our normal gradient descent. Then,
49:46.809 --> 49:52.989
I have added a momentum term, which is alpha
into w of t minus w of t minus 1. When I add
49:52.989 --> 49:58.249
this, this is heuristic and there is no derivation
here – we have not derived this particular
49:58.249 --> 50:00.220
term; this is heuristic.
50:00.220 --> 50:06.259
Let us see what happens. Objectively analyze
this equation in a flat surface. What is a
50:06.259 --> 50:13.259
flat surface? In a flat surface, w of t minus
w of t minus 1 is the same as w of t plus
50:19.269 --> 50:26.269
1 minus w of t; this is called a flat surface.
Say for example, my error curve is sometimes
50:31.430 --> 50:38.430
like this. I am moving like this. This is
a flat error surface. If I am here, I do not
50:41.420 --> 50:47.670
want to stay in the flat surface, I want to
come back to the value, I must go towards
50:47.670 --> 50:54.670
the value and so, my speed should be very
fast in this zone.
50:54.880 --> 51:00.880
In that case, w of t plus 1 minus w of t is
the same as w of t minus w of t minus 1, which
51:00.880 --> 51:07.880
is deltaw. If that is the case, I can rewrite
this equation. If I rewrite this equation,
51:10.140 --> 51:17.140
this is 1 minus alpha into deltaw and deltaw
is w of t plus 1 minus w of t and this is
51:18.730 --> 51:25.730
also the same – deltaw. This is my deltaw
and w of t plus 1 minus w of t is also deltaw.
51:26.690 --> 51:33.690
So, 1 minus alpha deltaw is minus eta into
del E upon del w t.
51:41.099 --> 51:47.619
Look at this equation. I can write down what
is my weight update deltaw, the increment
51:47.619 --> 51:54.619
in weight. It is minus eta upon 1 minus alpha
into delta E by delw t. Normally, deltaw is
52:00.079 --> 52:06.509
eta del E upon delw t, but utilizing this
heuristic term, I am able to increase the
52:06.509 --> 52:13.509
learning rate by a term called eta by 1 minus
alpha and alpha is less than 1. Obviously,
52:14.700 --> 52:21.700
if I say alpha is 0.5, then 1 minus 0.5 is
0.5 and the effective increase in the learning
52:24.720 --> 52:31.720
rate is twice. If I make alpha 0.9, then 1
minus 0.9 is 0.1 and so, the effective eta
52:32.749 --> 52:39.239
is now 10 times the actual eta. The learning
rate is thus increased by a factor 1 by 1
52:39.239 --> 52:45.420
minus alpha in the flat error surface. This
is how we can increase the convergence speed
52:45.420 --> 52:46.759
and this is an example.
52:46.759 --> 52:53.759
This is an XOR network. In the XOR network,
we have taken eta equal to 0.5 ;this is the
52:54.009 --> 53:01.009
learning rate, and alpha, the momentum rate,
is 0.8. You can easily see here that when
53:01.589 --> 53:07.200
we use the simple gradient descent, we do
not use the momentum. The number of iterations
53:07.200 --> 53:14.200
is 4200. I take 4200 iterations to converge,
that is, my RMS error is less than 0.0005.
53:19.410 --> 53:26.410
For reaching the same termination condition
momentum, I need only 2200 – almost half
53:31.069 --> 53:38.069
of the …. By adding a momentum, I could
reduce the number of training by half – this
53:42.160 --> 53:49.160
called speed of convergence. The convergence
speed is almost doubled using this momentum
53:51.970 --> 53:56.730
term. It is not that I can increase the convergence
speed by always adding a momentum, but it
53:56.730 --> 54:02.200
occurs in some cases; this is heuristic; this
is not always true.
54:02.200 --> 54:09.019
Similarly, for adaptive learning rate, we
will have a special class. For today, I will
54:09.019 --> 54:16.019
just give you a simple note on how we implement
adaptive learning rate in a heuristic manner
54:16.869 --> 54:23.319
- a heuristic approach to adaptive learning
rule. You see that in back propagation, eta
54:23.319 --> 54:30.319
is a fixed value. Once eta is fixed, most
likely, you will be in local minima, but if
54:31.309 --> 54:35.880
I vary this eta in a very intelligent manner
in such a way that it can avoid the local
54:35.880 --> 54:41.180
minimum, this is called adaptive learning
rate. Can I do that?
54:41.180 --> 54:47.910
Here is a heuristic approach for which we
do not have a theoretical basis. Based on
54:47.910 --> 54:54.910
certain intuition, we derive this algorithm.
What this particular heuristic algorithm is
54:56.200 --> 55:01.039
that a different learning rate is assigned
to each adjustable parameter; that is, every
55:01.039 --> 55:07.049
typical wij is associated with a different
eta. Each learning rate parameter should be
55:07.049 --> 55:12.200
allowed to vary from iteration to iteration.
When the derivative of the cost function with
55:12.200 --> 55:18.910
respect to weight has the same algebraic sign
for several consecutive iterations, the learning
55:18.910 --> 55:25.479
rate parameter for that weight should be increased.
When the algebraic sign of the derivative
55:25.479 --> 55:32.479
alternates consecutively for several iterations,
the learning rate parameter should be reduced.
55:33.910 --> 55:40.910
What I am talking about now is that we always
compute del E by del wij, where wij is the
55:42.400 --> 55:49.400
typical weight. If this derivative is constantly
positive, then for that particular wij, I
55:52.579 --> 55:58.470
increase the eta. If this derivative alternates,
that is, sometimes positive, then negative,
55:58.470 --> 56:03.359
positive, negative, positive, negative, then
what I do is I decrease this weight; this
56:03.359 --> 56:06.640
is the heuristic we introduce.
56:06.640 --> 56:13.640
Based on that heuristic, this is my adaptive
learning rate: etai p t is equal to mu etai
56:17.470 --> 56:24.470
p t minus 1 and d etai p t minus 1, where
mu and d are the increase and decrease factor,
56:32.269 --> 56:39.269
respectively. I either increase this by u
or decrease the previous eta by d. This is
56:41.059 --> 56:48.059
a multiplication factor. So I either increase
it or decrease it.
56:50.319 --> 56:54.869
You see that the increase factor means it
has to be greater than 1 and decrease factor
56:54.869 --> 57:01.869
means it has to be lesser than 1. You have
taken mu to be 1.1, d to be 0.7. We have taken
57:02.140 --> 57:09.140
the same XOR function. Now, using the adaptive
learning rate, training is done until the
57:10.900 --> 57:17.789
root mean square error is less than 0.0005
for the XOR gate.
57:17.789 --> 57:24.789
We finally see that without adaptive rate,
as we have already seen, the simple gradient
57:26.789 --> 57:33.789
descent gives you 4200 number of iterations
or training samples – I have to update the
57:39.640 --> 57:46.640
weights that many times; I have to update
the weights 4200 times, whereas using the
57:49.799 --> 57:56.799
heuristic adaptive learning rule, we could
reduce that to 1300 for the same root mean
57:58.930 --> 58:05.930
square error, reaching the same root mean
square error.
58:06.539 --> 58:13.539
Finally, to conclude, I would say what we
discussed today. We reviewed what back propagation
58:17.400 --> 58:24.400
is, we talked about the generalized rule and
then we talked about system identification
58:39.869 --> 58:46.869
of a surge tank. Then, we talked about adding
a momentum –
this is to increase the convergence speed
59:17.559 --> 59:24.559
as well as avoid local minimum. We also talked
about a heuristic adaptive learning rule.
59:33.460 --> 59:40.460
We will take a special class on adaptive learning
rule in detail - comprehensively how we can
59:41.619 --> 59:45.960
compute this adaptive learning rule having
a theoretical basis; probably after two or
59:45.960 --> 59:52.960
three classes, but this is the summary of
this class. Thank you very much.
416