WEBVTT
00:20.949 --> 00:27.949
This is lecture 9 of module I – neural networks
in this course on intelligent control. In
00:34.390 --> 00:41.390
this course, today, we will cover what is
recurrent network, unfolding in time of a
00:44.850 --> 00:51.850
recurrent network. What is a recurrent network?
A feed-forward network without feedback connection,
00:54.769 --> 01:01.769
recurrent network with feedback connection.
Unfolding in time, that is a recurrent network,
01:03.199 --> 01:10.199
when unfolded in time, can look like a feed-forward
network. Then, back propagation through time.
01:13.299 --> 01:20.299
This is the learning algorithm for recurrent
network. What is a recurrent network?
01:26.460 --> 01:33.460
Earlier, we said that if there is a neuron
and there is another neuron, if I connect
01:38.719 --> 01:45.719
this neuron to this neuron, then this connection
was not allowed in feed-forward network, whereas
01:51.549 --> 01:58.549
in a recurrent network, we allow not only
the feedback connection but also self-feedback.
02:04.929 --> 02:11.929
This is called recurrent network. When we
have two neurons in any feed-forward network
02:19.880 --> 02:26.880
– multilayer network or radial basis function
network, if I connect in a forward direction,
02:28.770 --> 02:33.620
the backward direction connection is not allowed,
whereas in recurrent network, not only the
02:33.620 --> 02:40.620
forward and feedback connections are allowed
but also self-feedback. This is known as self-feedback.
02:47.890 --> 02:52.379
This kind of structure is a recurrent network.
02:52.379 --> 02:59.379
Recurrent network includes feedback connections
in their architecture. What are the characteristics
02:59.610 --> 03:06.610
of a recurrent network? Representation of
time: sequence is important. I would like
03:10.980 --> 03:16.739
to explain this particular thing in detail.
03:16.739 --> 03:23.739
What is the meaning of sequence? Imagine we
have five patterns – 1 0 1, 1 1 1, 0 0 1,
03:33.159 --> 03:40.159
1 0 0, and 1 1 0. The meaning of sequence
is that this pattern follows this pattern,
03:54.049 --> 03:59.459
this pattern follows this pattern, this pattern
follows this pattern, and this pattern follows
03:59.459 --> 04:06.459
this pattern. Imagine there is a display board
and I want to display this pattern first,
04:07.640 --> 04:14.640
this pattern second, this pattern third, fourth,
fifth and rotate it clockwise. In that sense,
04:15.349 --> 04:22.349
the sequence is important here. In this case,
sequence is important.
04:37.070 --> 04:44.070
Another example of sequence is
the time series prediction, any kind of time
series prediction. Let us say
rainfall in a year or in a decade or in a
05:11.780 --> 05:18.780
century. Let us say rainfall month-wise in
a year – January, February and so on up
05:21.400 --> 05:28.400
to December. The pattern of a rainfall in
a year is a sequence. This is say in mm. So,
05:32.850 --> 05:39.850
some mm in January, some mm in February, some
mm in December and maybe in September, it
05:44.000 --> 05:51.000
is a maximum in a particular place. When we
are dealing with patterns in which time is
05:56.370 --> 06:03.370
intrinsic, this is called temporal patterns,
that is, in a year, I want to see how the
06:14.410 --> 06:21.410
pattern of rainfall is. There is a time involved
in this. This particular notion of sequence
06:27.220 --> 06:34.220
is important, representation of time: sequence
is important, I can explain this particular
06:35.360 --> 06:42.080
concept in another way.
06:42.080 --> 06:49.080
For example, I have a data pattern, this is
a data set. In this data set 1, 2, 3, 4, 5,
06:58.190 --> 07:05.190
6, the data, which are stored in this particular
stack, are in a sequence. Data are stored
07:13.990 --> 07:20.990
in a sequence. If data is stored in a sequence,
then when I have to represent this data to
07:25.830 --> 07:32.830
a network, I can only represent them in a
sequence. In a multilayer network, if I have
07:35.040 --> 07:42.040
a data set, I can always select any of these
data randomly and present to the network because
07:44.040 --> 07:50.170
the sequence is not important. I will tell
this point in detail in a system identification
07:50.170 --> 07:57.170
example because this is an intelligent control
course. I will explain using a relevant example
07:57.250 --> 08:00.540
in detail.
08:00.540 --> 08:07.540
Characteristic is representation of time where
sequence is important. The other thing is
08:11.140 --> 08:18.140
rich temporal and spatial behaviors, that
is, this network is dynamically very rich.
08:22.990 --> 08:29.990
It has stable, unstable fixed points, it exhibits
limit cycles, chaotic behaviors, and it has
08:31.440 --> 08:38.440
many other rich dynamic behavior that any
dynamic nonlinear system exhibits, which a
08:40.320 --> 08:47.320
static network like a multilayer network or
radial basis function network do not manifest
08:50.370 --> 08:57.370
(these networks do not manifest rich dynamical
characteristics). The applications of recurrent
09:01.840 --> 09:08.170
networks are of course system identification
and control, associative memory, time series
09:08.170 --> 09:15.170
prediction, temporal pattern recognition:
signal classification and speech recognition.
09:16.510 --> 09:20.870
Signal classification and speech recognition
are all temporal pattern recognition terms.
09:20.870 --> 09:27.740
Now, we will talk about the architecture of
a recurrent network. These two architectures
09:27.740 --> 09:34.740
are partial recurrent network. Partial means
the main body of the network is like a multilayer
09:39.290 --> 09:46.290
network, where there are feed-forward structures
layer-wise, but at the input, you can see
09:46.810 --> 09:53.810
that there is a temporal sequence. The input
to the network is x t, x t minus 1, this one
09:56.430 --> 10:03.430
is x t minus 1, this is one delay further,
x t minus 2 until x t minus n. This is normally
10:06.450 --> 10:13.450
known as time delay network.
10:15.380 --> 10:22.380
Imagine we are trying to model a function
like y t plus 1 is function of x t, x t minus
10:30.050 --> 10:37.050
1 and so on. When this is the dynamic model,
then this kind of network is selected. In
10:45.860 --> 10:52.860
many system identification examples, another
kind of network is used where you see that
10:54.140 --> 11:00.770
this is a multilayer network structure – main
body, but then, the input to the multilayer
11:00.770 --> 11:07.770
network in one sequence is the delayed version
of x t. So there is x t, x t minus 1 up to
11:08.520 --> 11:15.520
x t minus n. Similarly, this is a delay, so
the response is y t up to y t minus m. Why
11:25.600 --> 11:32.600
do we say partial recurrent? It is because
the main body of the network resembles a multilayer
11:34.020 --> 11:41.020
network, the feed-forward network. This kind
of network has been used particularly in system
11:44.110 --> 11:51.110
identification and time series prediction.
11:51.810 --> 11:58.810
Here, these are fully recurrent networks.
Earlier, we saw a partially recurrent network
12:00.590 --> 12:07.590
but this is a fully recurrent network. In
a fully recurrent network, you see that the
12:12.240 --> 12:19.240
input to this neuron is coming from this neuron
as well this neuron. We can easily see that
12:20.600 --> 12:27.600
this one is given here and again, this one
is given here. The input to this neuron is
12:31.650 --> 12:38.490
coming from the other two neurons. Similarly,
this neuron's inputs are coming from this
12:38.490 --> 12:45.490
neuron and this neuron. This is this neuron
and this is the other neuron and similarly,
12:47.470 --> 12:54.470
this neuron, which is this one, and this neuron,
which is this one….
12:57.100 --> 13:03.400
In that sense, this is a fully recurrent network.
Each neuron is connected to all other neurons
13:03.400 --> 13:10.400
in the network. Normally, Hopfield network
has this structure. There is a little mistake
13:20.550 --> 13:27.550
here. The branches are missing. This is another
fully recurrent network. This network and
13:33.089 --> 13:40.089
this network are almost the same. The only
difference between this network and this network
13:47.270 --> 13:54.270
is that there is a self-feedback. The output
from this network is again fed back to the
13:56.930 --> 14:03.930
same unit. These two architectures are called
fully recurrent. Each neuron is connected
14:05.210 --> 14:12.210
to all other neurons. One is without self-feedback
and the other is with feedback. But we can
14:12.760 --> 14:18.440
have various…. These are very simplistic
models that I am introducing to you. There
14:18.440 --> 14:23.670
are also many other complex models – we
will not be discussing them.
14:23.670 --> 14:30.180
Once you have an architecture, obviously,
you would like to use it. As you know, once
14:30.180 --> 14:37.180
we talk about neural network, it means the
weights have to be learned, they have to be
14:38.890 --> 14:44.170
trained. There are various learning algorithms
but what we will mainly focus in this class
14:44.170 --> 14:50.170
and in this course are back propagation through
time, another is real-time recurrent learning.
14:50.170 --> 14:57.170
There are others also: extended Kalman filtering
and many other learning algorithms that are
14:58.410 --> 15:03.270
there in the literature. We will not be taking
up those things now.
15:03.270 --> 15:10.270
Let us take a discrete dynamic equation, which
is y t plus 1 is f of y t plus g of y t into
15:28.920 --> 15:35.920
u t, where y and u are output and input, respectively,
and f and g are arbitrary nonlinear functions.
16:12.589 --> 16:19.589
Let us say that this particular dynamic that
we have written here y t plus 1 is f y t plus
16:20.320 --> 16:27.320
g y t u t represents some nonlinear system.
Now, I want to model using a feed-forward
16:35.190 --> 16:42.190
network as well as the recurrent network the
dynamics of this equation, let us say this
16:48.670 --> 16:55.670
equation. If I want to use a feed-forward
network, what I will do is….
16:56.640 --> 17:03.640
In feed-forward network…. This is called
a feed-forward network. My input will be y
17:16.480 --> 17:23.480
t as well as u t and output is y t plus 1.
This is actually predicted output of the network
17:30.259 --> 17:37.259
and y t is actually the input to the network,
but is obtained from the actual system. If
17:45.720 --> 17:52.720
I really want to capture this particular dynamics
that we are looking at y t plus 1 is f y t
17:52.929 --> 17:59.929
plus g y t into u t using a feed-forward network,
what I do is that my inputs are y t as well
18:03.820 --> 18:10.679
as u t and the output is y t plus 1 (this
is the predicted one), whereas this y t is
18:10.679 --> 18:14.210
taken from the actual system.
18:14.210 --> 18:21.210
I observe the state of the system at t, y
t and give an input u at t. Then, my output
18:28.759 --> 18:35.759
y t plus 1 is predicted by the feed-forward
network, but a recurrent network, on the contrary,
18:40.549 --> 18:47.549
only takes the input u t and the output is
y t plus 1, that is, it is possible because
18:57.239 --> 19:04.239
a recurrent network allows feedback connection
from output to input, the internal states.
19:07.090 --> 19:14.090
So, recurrent network is much like a
19:23.139 --> 19:30.139
dynamic system, whereas a feed-forward network
is a static system. This is very important
19:35.869 --> 19:40.879
to understand – a feed-forward network is
a static system, because….
19:40.879 --> 19:47.350
While training a feed-forward network, we
really do not worry; we really do not care
19:47.350 --> 19:54.350
about the order of the pattern. Given a pattern,
that is, input/output data set, I can select
19:55.190 --> 20:02.190
any pattern, give the input of that pattern
to the network, forward pass that input, find
20:02.350 --> 20:07.279
what is the output, compare with the desired
one, back propagate the error, update the
20:07.279 --> 20:13.440
weight – that is the feed forward network,
whereas, in a recurrent network, the sequence
20:13.440 --> 20:19.899
of data to be presented in the network must
be preserved.
20:19.899 --> 20:26.899
How will I present data to this recurrent
network? u 0 y 1, then u 1 y 2, so the sequence
20:30.049 --> 20:37.049
has to be like this: u 0 y 1 first, then u
1 y 2 second and so on, but I cannot arbitrarily
20:47.769 --> 20:54.769
select any of these sequences and present
to the recurrent network. There is a representation
20:56.609 --> 21:03.609
of time in a recurrent network, the sequence
is important, whereas time does not play any
21:06.570 --> 21:10.929
relationship in feed-forward network – it
is a static map.
21:10.929 --> 21:15.450
The normal training algorithms that are used
for a recurrent network is back propagation
21:15.450 --> 21:22.450
through time and real-time recurrent learning.
These two are very popular algorithms available
21:24.399 --> 21:31.019
in the literature, but there are also other
like extended Kalman filtering, but in this
21:31.019 --> 21:37.489
course, we will only learn or I will only
take up these two learning algorithms – back
21:37.489 --> 21:41.820
propagation through time, which is offline,
and real-time recurrent learning, which is
21:41.820 --> 21:48.820
an online learning algorithm. Today, this
lecture will focus on back propagation through
21:51.309 --> 21:58.179
time. In back propagation through time, we
unfold the recurrent network in time to reach
21:58.179 --> 22:00.909
a feed-forward network architecture.
22:00.909 --> 22:07.909
What is that unfolding? The unfolding is….
Let us take a simple recurrent network that
22:21.899 --> 22:28.899
consists of only a single neuron, the input
is x t, the input to the neuron – the weight
22:31.639 --> 22:38.639
is w; there is a feedback from the output
to the self-feedback and the weight is g.
22:40.230 --> 22:47.230
This is a very simple network, because it
consists of only one neuron. The first part
22:48.289 --> 22:54.919
is that before we talk about unfolding this
network in time…. This is the feedback structure
22:54.919 --> 23:01.919
and the objective is how to convert this feedback
structure to a feed-forward structure. We
23:17.320 --> 23:23.460
want to do this. I want to convert this feedback
structure to a feed-forward structure. How
23:23.460 --> 23:24.029
do I do it?
23:24.029 --> 23:31.029
It is very easy to do. Before we can do that,
we must write down the forward response of
23:33.379 --> 23:40.379
the network. What is the meaning of forward
response? Given x t, what is y t plus 1? Let
23:44.539 --> 23:51.539
us say this neuron has sigmoidal activation
function, that is, y t plus 1 is w x t plus
24:03.679 --> 24:10.679
g y t. You can easily see that this is the
feedback, so we assume that when I do not
24:14.809 --> 24:21.809
put anything on this connection, it means
that there is a delay of 1 unit. So, the y
24:24.029 --> 24:31.029
th t plus 1 sampling instant is a function
of w x t, the summation of all the inputs
24:35.379 --> 24:42.379
– the w into x t plus g into y t. This is
the total summation and the summed input at
24:46.340 --> 24:53.340
the input of this neuron and then sigmoidally
activated, where f z is as usual… the sigmoidal
24:57.720 --> 25:04.720
activation is e to the power minus z. This
is our simplest network and to this simplest
25:06.979 --> 25:12.409
network, how do you unfold this network?
25:12.409 --> 25:19.409
Let us see this forward response that we wrote,
which is y t plus 1 is f of w x t plus g y
25:28.169 --> 25:35.169
t. This is the forward response of the simplest
network that we took. If I write recursively,
25:37.169 --> 25:44.169
what do I do? I write y 1 is f of w x 0 plus
g y 0. Similarly, y 2 is f of w x 1 plus g
25:57.389 --> 26:04.389
y 1. Look at this equation. I simply put t
= 0, then I get the first equation; if t = 1,
26:08.039 --> 26:15.039
I get the second equation; if t = 2, then
I will get the third equation and so on. We
26:24.700 --> 26:31.700
can go on like this. How do we construct a
feed-forward network looking at these equations?
26:33.570 --> 26:40.570
What I will do is in my network, I create
a node. In every node, I compute what is y
26:41.869 --> 26:48.869
1 as a function of x 0 y 0, y 2 as a function
of x 1 y 1 and y 3 as a function of x 2 and
26:52.549 --> 26:57.080
y 2. This is what is here.
26:57.080 --> 27:04.080
This is what we have given. Here, you see
that y 1 is w x 0 plus g y 0. Similarly, y
27:08.539 --> 27:15.539
2 is w x 1 plus g y 1 and sigmoidally activated.
This neuron is sigmoidally activated. Thus,
27:19.359 --> 27:26.359
if you look at our recursive equations here,
f of this summation…. . That means you should
27:31.279 --> 27:38.279
not forget what is the meaning of f. f is
always a sigmoidal activated function. That
27:38.859 --> 27:41.629
is what we are doing here.
27:41.629 --> 27:48.629
y 1 is f of w x 0 plus g y 0, y 2 is f of
w x 1 g y 1 and so on. This is called rolling
27:54.460 --> 28:01.460
the network in time. Now, if you look at this
network that I discussed now, if rolled over
28:02.379 --> 28:09.379
t = 5, then for t = 1, this is the first expansion,
for t = 2, this is the second expansion. So,
28:10.690 --> 28:16.889
the first expansion y 1 in terms of y 0 x
0, the second expansion y 2 in terms of x
28:16.889 --> 28:23.889
1 y 1, third y 3 in terms of x 2 y 2, similarly
fourth y 4 in terms of x 3, y 3 and five y
28:26.830 --> 28:33.830
5 in terms of x 4 y 4. This is how the network
is unfolding over time and we can go ahead
28:39.129 --> 28:44.369
depending on the situation or demand.
28:44.369 --> 28:51.369
Once we unfold, now this is a feed-forward
network. If you look at this, this is a feed-forward
28:53.029 --> 29:00.029
network. This feed-forward network is actually
MLN – multilayer network. With how many
29:08.279 --> 29:14.609
layers? I can say this is one layer, this
is the first layer – the first layer of
29:14.609 --> 29:19.090
weight, this is the second layer of weight,
this is the third layer of weight, fourth
29:19.090 --> 29:26.090
layer and fifth layer. So, this is an MLN
with five layers of weights. Given this structure,
29:39.210 --> 29:44.129
once I convert the recurrent network into
a feed-forward network, the objective would
29:44.129 --> 29:51.129
be now to apply… because this is now a feed-forward
network, so I apply the back propagation through
29:51.749 --> 29:58.749
time – standard back propagation, generalized
delta rule to update the weights.
29:59.590 --> 30:06.590
How do I do it? Now I already have the dynamic
equation here: y t plus 1 is f w x t plus
30:11.169 --> 30:18.169
g y t. This is the dynamic equation. What
I do is that given y 0 and the sequence of
30:25.429 --> 30:31.499
x 0, x 1.… These are all external input
– x 0, x 1, x 2, x 3, x 4 (these are all
30:31.499 --> 30:38.499
external inputs). Given this external input
and initial condition, which is y 0, let us
30:39.470 --> 30:46.470
compute what is y 1, y 2, y 3, y 4 and y 5.
This is the first step. The first step is
30:52.570 --> 30:59.570
given the sequence x 0, x 1 and so on up to
x 4 and initial condition… this is the initial
31:13.330 --> 31:20.330
condition
and these are external inputs, compute y 1,
y 2 … y 5. This is how we generate the data.
31:41.179 --> 31:48.179
Once we generate the data, what I can do is
that I now forward propagate my signal. Initial
31:52.190 --> 31:59.190
condition is y 0. Initially, this w and g
are the same values but random values. Whatever
32:00.499 --> 32:07.499
w, initially, these ws are all same but all
random values, some random values. Similarly,
32:07.769 --> 32:14.769
gs are all random. Then, what I do is I take
the initial condition y 0 and x 0 is the input
32:19.830 --> 32:26.830
and compute what is y1; take x 1, compute
y 2; take x 2, compute y 3; take x 3, compute
32:28.159 --> 32:35.159
y 4; take x 4, compute y 5. Once given x 0,
x 1, x 2, x 3, x 4 and given the initial values
32:40.299 --> 32:46.539
of w and g…. These w and g values are all
the same in the initial network; actually,
32:46.539 --> 32:53.539
they are the same always, because for all
layers, w and g are same at any given instant;
32:56.200 --> 32:57.960
they are not different.
32:57.960 --> 33:04.960
Given y 0 and this sequence we can compute
y 1, y 2, y 3, y 4, y 5. From this model,
33:11.279 --> 33:18.279
we get the actual y 5, y 4, y 3, y 2 and y
1. Obviously, we can compute at every target
33:19.929 --> 33:26.929
– in this target e 1, in this target e 2,
this target e 3, this target e 4 and this
33:29.219 --> 33:36.219
target e 5. What is e 5? The response of the
network is subtracted from the actual response
33:38.379 --> 33:45.379
of the network, which is y 5. This e 5 is
y d 5 minus y 5, e 4 is y d 4 or y desired
33:49.599 --> 33:56.599
4 minus y 4 and so on. Similarly, y desired
3 minus y 3 is e 3. We compute these errors
33:56.940 --> 34:03.940
at every target. Now, apply the back propagation
through time. How do I do it? Let me explain
34:11.760 --> 34:13.940
in this manner.
34:13.940 --> 34:20.940
Let us take a very simple thing, that is,
only two layers. This is my x 0, y 0, this
34:33.839 --> 34:40.839
is y1, this is y2 and x1. Obviously, this
is w, this is w, this is g and this is g.
34:51.460 --> 34:58.460
What is y2? y2 is f of w x1 plus g y1. How
do I update this w and this g? This is my
35:01.859 --> 35:08.859
second layer. I will write delta w 2, the
change in delta w 2, using the gradient descent
35:12.779 --> 35:19.779
back propagation algorithm. How do I write
this? This delta w 2 is eta delta… this
35:20.599 --> 35:27.559
is the output layer, so I put delta2 – the
error back propagated from the second layer,
35:27.559 --> 35:34.559
output layer and input is x 1. This is the
standard delta rule. Similarly, delta g in
35:56.049 --> 35:57.150
the second layer….
35:57.150 --> 36:04.150
I say this is the second layer and this is
the first layer. In the second layer, delta
36:10.619 --> 36:17.619
g due to the error from here is eta delta2
and input is y 1. What is delta2? delta2 is
36:25.630 --> 36:32.630
obviously the error here, which is e2 into
y2 into 1 minus y2. What is y2 into 1 minus
36:39.020 --> 36:46.020
y2? We assume that f dash, which is df by
dt is y2 into 1 minus y2, when f z is sigmoidal
37:02.819 --> 37:09.819
activation function. We have earlier mentioned
this point that if f z is 1 upon 1 plus e
37:09.819 --> 37:16.819
to the power of minus z, then f dash is z
into 1 minus z. That is what we are writing.
37:20.329 --> 37:27.329
What is e2? e2 is y2 desired minus y2. This
is the first layer. Similarly, second layer
37:31.789 --> 37:38.599
will be again delta w in the first layer.
37:38.599 --> 37:45.599
We can again do this for the first layer now.
delta w 1 is eta delta1 x0 and delta g 1 is
38:02.470 --> 38:09.470
eta delta1 y0, because this is a standard
delta rule. What is delta1? delta1 is obviously
38:18.529 --> 38:23.710
not only error back propagated from here,
there is also error here, and that error is
38:23.710 --> 38:30.710
e1, the error here and the back propagated
error is delta2, back propagated through g
38:40.400 --> 38:47.400
– that is the delta error back propagated
and this is your y1 into 1 minus y1. So, this
38:56.059 --> 38:59.029
represents delta1.
38:59.029 --> 39:06.029
Like that, using simple delta rule, given
the data sequence, we can compute what is
39:10.420 --> 39:17.420
delta w 1, delta g 1, delta w 2, delta g 2
and finally add the changes in weight in every
39:23.760 --> 39:30.760
layer and sum that, and finally, w is updated
as w new is w old plus sigma delta w r in
39:42.980 --> 39:49.319
every layer, because now, the same weight
has been updated in every layer, but the weight
39:49.319 --> 39:56.319
is same weight. So, all those updates according
to back propagation, the infinitesimally small
39:56.839 --> 40:03.839
change in weight that we get, we add over
all the layers and then add to the old weight
40:06.410 --> 40:13.410
and that becomes the new weight. That is the
objective here. For the network with t = 5,
40:24.670 --> 40:27.660
I just showed you t = 2.…
40:27.660 --> 40:34.660
t = 5. Compute the response of the sequence
y 1 to y 5, given the sequence x 0 to x 4
40:35.769 --> 40:42.769
and y 0. We are talking about the network
that we have already talked about here. This
40:45.700 --> 40:52.700
is my total feed-forward network constructed
from the recurrent network by rolling over
40:54.960 --> 41:01.960
time and for this, we are showing the error.
Computed error e 5 is y d 5 minus y 5, delta
41:08.950 --> 41:15.950
y 5 is eta delta 5 x 4, delta g 5 is eta delta
5 y 4. We can easily see here what we are
41:20.420 --> 41:27.420
trying to do. e 5 is y d 5 minus y 5. This
delta w, the change in this weight, is obviously
41:30.190 --> 41:37.190
eta delta 5 x 4, the change in this weight
g is eta delta 5 into input is y 4. That is
41:41.980 --> 41:48.980
what we are writing here, where delta 5 is
the differentiation of the activation function
41:50.130 --> 41:57.130
or derivative of the activation function and
the error at the output.
41:58.440 --> 42:05.440
But when I come to the layer 4, my error at
the layer 4 is e 4 is already there and the
42:10.150 --> 42:17.150
back-propagated error is delta 5 into g, which
we will talk about now here. The error here
42:20.329 --> 42:27.329
is e 4 plus back-propagated error from this
layer, which is delta 5 into this weight g.
42:29.670 --> 42:36.670
Like that, we have to compute. This is for
fourth layer.
42:38.250 --> 42:45.250
This for third layer, and finally, this is
the first layer and this is the second layer.
42:55.309 --> 43:02.309
This is the way we apply the generalized delta
rule to this feed-forward network that has
43:12.609 --> 43:18.890
been constructed from the recurrent network
by rolling the recurrent network over time.
43:18.890 --> 43:25.890
Finally, after finding delta w over all the
layers, delta g over all layers, what do we
43:26.799 --> 43:33.799
compute? w new equal to w old and we add over
all layers delta w.
43:34.950 --> 43:41.950
Similarly, g new is g old plus delta g i over
all layers; we add them because there are
43:45.349 --> 43:52.349
five layers of weights, we add and then we
find g new. We talked about a simple network.
43:54.799 --> 44:01.799
Now, let us go to a little complicated network.
Is a complicated network very difficult to
44:04.579 --> 44:11.180
roll over time? No, it is not. Once we understand
the principle, rolling over time is not very
44:11.180 --> 44:11.819
difficult.
44:11.819 --> 44:18.819
For example, you see here, this is a fully
connected network, we have three neurons and
44:21.099 --> 44:28.099
two external inputs X1 and X2 and one output
is Y. Let us say the output of this neuron
44:32.680 --> 44:39.680
is this is X3, this is X4, this is X5. So,
Y is X5. If I look at the forward response,
44:45.619 --> 44:51.309
if you look at the normal convention, if you
look at the weight connections, the weights
44:51.309 --> 44:58.309
that are being connected here, this weight
from the input 1…. You see the connection
45:02.250 --> 45:06.579
X1, X2, X3, X4 and X5.
45:06.579 --> 45:13.579
Convention-wise, the weights are…. This
is the node to which the connection is coming
45:14.400 --> 45:21.400
from 1. Obviously, the weight that has been
named is w31. w31 means the weight that connects
45:24.380 --> 45:31.380
the node 3 from node 1. Similarly w33, by
just looking at this weight, you can say this
45:33.690 --> 45:40.690
is a self-feedback weight, that is, this weight
connects node 3 to itself. Similarly, w35,
45:43.579 --> 45:50.579
you can see that this weight is coming from
the node 5 and connecting to 3. Similarly,
45:52.039 --> 45:59.039
w45, if you look at this one, this is coming
from 5 and connecting 4. Similarly, w55 is
46:00.250 --> 46:03.170
a self-feedback weight.
46:03.170 --> 46:10.170
This is the convention of designating the
weights and this is a fully connected network.
46:10.390 --> 46:17.390
What is the forward response? Forward response
is what the output of each neuron is given
46:21.039 --> 46:28.039
X1 and X2 and the present state X3 t, Y5 t
and X4 t. So, X3 t plus 1 is the response
46:33.349 --> 46:40.349
of the node 3 at the next time instant is
a sigmoidal activation function f of the summation
46:46.500 --> 46:53.500
of w3i Xi t. Similarly, X4 t plus 1 is w4i
Xi t and X5 t plus 1 is w5i Xi t, because
47:08.500 --> 47:15.500
each node is connected to every node – all
the five nodes (two input nodes and three
47:16.109 --> 47:22.240
neurons, total 5 nodes).
47:22.240 --> 47:29.240
You can easily see that X3 t plus 1 is function
of S3 t plus 1. What is S3 t plus 1? S3 t
47:42.369 --> 47:49.369
plus 1 is summation that is w31 X1 plus w32
X2 plus w33 X3 and so on w35 X5. We are inserting
48:05.089 --> 48:12.089
this particular item here, which is known
as w3i Xi t. This has to be t. Time has to
48:17.259 --> 48:24.190
be there; we cannot do away with time in a
recurrent network like we have done in the
48:24.190 --> 48:26.319
feed-forward network.
48:26.319 --> 48:33.319
For the same network, once I know this forward
response, I can easily write down here, I
48:37.130 --> 48:44.130
can unfold the same network in time here.
Obviously, this is my X3 0, this is X3 1.
48:47.289 --> 48:54.289
So, X3 1, X3 at time 1 is a function of X1
and this is w31; for X2, this has to be w32;
48:59.619 --> 49:06.619
this is for X3. The best way to look at it
is like this.
49:10.079 --> 49:17.079
I have two input nodes and this is the initial
state. This is X1 0, X2 0 and these are all
49:22.200 --> 49:29.200
initial states. Based on this initial state,
the new states are X3 1, X4 1 and X5 1. How
49:42.210 --> 49:49.210
do I compute X3 1? X1 0 w31 is the weight
connecting X1 0. From X2 0, this is w21. From
49:56.240 --> 50:03.240
X3 0, it is w31, from X4 0, it is w41 and
from X5 0, it is w51. The forward response
50:12.920 --> 50:19.920
X3 1 is a function of
summation w3i x 0. Based on that, we found
out the connection between X3 1, and these
50:37.960 --> 50:44.960
initial states in the previous instant. Similarly,
X4 1 and X5 1. That way, we can roll this
50:52.589 --> 50:57.890
network over time.
50:57.890 --> 51:04.890
Let us take one example today to illustrate
what we have learned. Let us take a simple
51:06.900 --> 51:13.900
linear network, linear dynamics. y t plus
1 is minus 0.5 y t minus y t minus 1 plus
51:15.640 --> 51:22.640
0.5 u t –a simple discrete dynamic equation.
It is a linear dynamic equation, discrete
51:27.920 --> 51:34.920
dynamic. We can represent this particular
equation in terms of this. y t plus 1 is w1
51:40.349 --> 51:47.349
y t plus w 2 y t minus 1 plus w3 u t, where
w1, w2 and w3 are unknown weights (we do not
51:51.579 --> 51:58.579
know) and this is the recurrent network to
represent that.
51:58.619 --> 52:05.619
We give x t, output is y t plus 1 and you
see that there are two feedback connections
52:08.549 --> 52:15.549
here, self-feedback with one delay and this
is two delay. Obviously, if I write the forward
52:20.950 --> 52:27.950
response of this network, then this is the
forward response. The forward response is
52:32.869 --> 52:39.869
y t plus 1 is x t into this is w3 x t into
w3, this quantity. Actually, we say x t but
52:45.690 --> 52:52.690
in this case, it is u t (x t is u t). So,
x t into w3 or u t into w3 plus w1 into y
53:00.670 --> 53:07.670
t is one delay and plus w2 into y t minus
1 is two delay. This is my recurrent network
53:11.390 --> 53:18.390
and activation function here f is linear.
So f is linear.
53:21.789 --> 53:28.789
If I take this network, similarly from the
forward response, I can write what is y 2.
53:28.990 --> 53:35.990
y 2 is w1 y 1 plus w2 y 0 plus w3 u 1 and
y3 is w1 y 2 plus w2 y 1 plus w3 u 2 and so
53:39.529 --> 53:46.529
on. This is the corresponding unfolding in
time.
53:46.730 --> 53:52.829
We apply this back propagation through time
using generalized delta rule. We generated
53:52.829 --> 53:59.829
100 data points, where the input u t was selected
randomly between 0 and 1, and given the dynamic
54:04.640 --> 54:11.640
equation that we represented, we gave for
data generation, we used this particular equation
54:15.269 --> 54:22.269
for data generation. If you look at this data,
the circles represent the output, the cross
54:31.029 --> 54:38.029
is u t and the circle is y t plus 1. This
is the input data generated.
54:44.829 --> 54:51.829
Then with back propagation through time, we
trained it. When you trained, let us see how
54:53.799 --> 55:00.799
the training took place. This is RMS error
over number of epochs. You can see how the
55:04.950 --> 55:11.950
RMS error reduced. There were almost 3,000
epochs.
55:13.039 --> 55:20.039
These are the weights. This is our w3, this
is our w1, and this is our w2. w3 has reached
55:29.460 --> 55:36.460
0.5, w1 has reached minus 0.5, and w2 is ?1.
If I go back and look at the equation that
55:42.200 --> 55:49.200
was given to us, you can see this was my w3,
which is 0.5, w2 also I got ?1 and w1 is ?0.5.
55:56.309 --> 56:03.309
The error surface is the plot of cost function
E versus the weight vector W. You can see
56:03.839 --> 56:10.839
how finally we reached the global minimum.
Since this is the linear dynamics, we can
56:13.450 --> 56:20.450
easily reach the global minimum and that is
not a problem.
56:23.039 --> 56:29.269
To conclude, in this lecture, we learned about
the architecture of a recurrent network, unfolding
56:29.269 --> 56:35.019
a recurrent network in time to make it a feed-forward
network, training of a recurrent network using
56:35.019 --> 56:42.019
BPTT algorithm and we showed an example of
a system identification of a linear discrete
56:45.059 --> 56:52.059
dynamic. Using BPTT, we showed that the actual
parameters are learned. Thank you very much.
320