WEBVTT
00:19.470 --> 00:26.470
This is lecture 6 in the first module of neural
networks, a course on intelligent control.
00:29.189 --> 00:36.189
Today, we will take up another architecture
of learning system that is used widely in
00:43.280 --> 00:49.850
control system. This is called radial basis
function network.
00:49.850 --> 00:56.850
The network architecture is given here. You
can see that it has three layers. The first
01:03.460 --> 01:10.460
layer is as usual the input layer. The second
layer, which is the hidden layer, is different
01:10.490 --> 01:17.490
from the computational unit in the hidden
layer – different in structure compared
01:19.040 --> 01:26.040
to the multilayer network that we discussed.
These computational units are known as radial
01:29.000 --> 01:36.000
centers. It is a radial center because if
I look at C1, this is a p into 1 vector – the
01:49.830 --> 01:56.830
same as the x vector, which is p into 1, the
input vector. These centers represent the
02:04.890 --> 02:11.890
clusters in the input space; C1, C2 and Ch
represent the clusters in the input space.
02:17.900 --> 02:24.900
So, Cis represent clusters in the input space.
The output of each center, which is phii,
02:45.140 --> 02:52.140
is a function of the Euclidean distance between
Ci and x.
02:59.300 --> 03:06.300
You can see that the computational unit has
a different function than the multilayer network.
03:15.790 --> 03:22.790
Then what you are seeing is that we compute
this phii. The Cis are known as radial centers;
03:23.230 --> 03:30.230
they represent small clusters in the input
space. Then y, the output, is simply summation
03:34.520 --> 03:41.520
of phii wi, i equal to 1 to h.
This is the structure of a radial basis function
network. There are certain advantages we will
03:57.150 --> 04:01.410
talk about. Let me mention what we talked
about just now.
04:01.410 --> 04:05.519
The radial basis function network consists
of three layers: input layer, hidden layer
04:05.519 --> 04:12.519
and output layer. Unlike in multilayer network,
you can have as many hidden layers, but here,
04:14.599 --> 04:19.709
you can have only one hidden layer. The computational
units in the hidden layer have a different
04:19.709 --> 04:24.699
function than that of multilayer network.
The hidden units provide a set of functions
04:24.699 --> 04:30.530
that constitute an arbitrary basis of input
patterns. The hidden units are known as radial
04:30.530 --> 04:37.530
centers and are represented by the vectors
C1, C2, … Ch and they have the same vector
04:40.740 --> 04:44.720
dimension as that of the input x.
04:44.720 --> 04:50.020
Transformation from input space to hidden
unit space is nonlinear, whereas the transformation
04:50.020 --> 04:55.090
from the hidden unit space to output space
is linear – that is what we talked about.
04:55.090 --> 05:02.090
Here, the phiis is the nonlinear transformation
of x input. Actually, this is a function of
05:05.740 --> 05:12.740
the known or the Euclidean distance between
Ci and x, which we normally represent by the
05:13.690 --> 05:20.690
second. This is nonlinear, phii is a nonlinear
function of xi, whereas the output y is a
05:24.830 --> 05:31.830
linear function of phi. That is the difference.
The dimension of each center for a p input
05:34.030 --> 05:41.030
network is p by 1 – I have already told
you that. This is our network that we discussed.
05:41.620 --> 05:46.610
The radial basis functions in the hidden layer
produce a significant non-zero response only
05:46.610 --> 05:53.610
when the input falls within a small localized
region of the input space. Let me explain
05:56.509 --> 06:03.509
this. We are talking about input space. These
data belong to this input space. The input
06:11.050 --> 06:18.050
space instead of being represented by each
and every data, the principle in radial basis
06:20.199 --> 06:27.199
function network is say for example, the inputs,
the data are all distributed in a set like
06:29.539 --> 06:36.539
this. If this is the data, each point represents
a datum in this input space, we can easily
06:43.039 --> 06:50.039
say let us select from this bulk of data two
clusters here, another two clusters here,
06:51.889 --> 06:55.750
another two clusters here. These are small
data points – one cluster, another cluster.
06:55.750 --> 07:02.360
What is the meaning of cluster? This is another
point that represents the data around this.
07:02.360 --> 07:09.360
Similarly, this is another point that represents
the data around this. When the input data
07:12.259 --> 07:19.259
is very near this cluster, then the output
of this cluster or output of this center will
07:21.169 --> 07:28.169
be maximum and the response of this radial
center…. Let me first clarify what I am
07:31.050 --> 07:36.319
trying to tell you. What I am telling you
is that we have a radial center and these
07:36.319 --> 07:43.319
radial centers are placed in the hidden unit.
These radial centers are the clusters in the
07:43.330 --> 07:50.330
input space. Input space means a space where
all possible data in the input are located.
07:55.650 --> 08:02.650
That means we are trying to construct a map
from the input space to output space. In the
08:07.370 --> 08:14.370
input space, all possible varieties of data
are there. The radial centers represent this
08:20.910 --> 08:27.910
data. In an ideal case, the input space may
contain many data – not infinite, but many
08:31.630 --> 08:38.630
data. To represent this data using finite
data points and these finite data points are
08:46.350 --> 08:53.350
known as clusters. Each cluster has its own
receptive field in the sense that each cluster
08:57.980 --> 09:04.980
represents a certain data point and for that
data point, that radial center output will
09:06.220 --> 09:12.899
be maximum.
09:12.899 --> 09:19.899
I place these clusters in this hidden layer.
For example, if I say this cluster is this
09:22.309 --> 09:29.309
cluster, then the output of this cluster or
radial center will be maximum for the data
09:35.189 --> 09:42.189
lying in its receptive field. Obviously, the
output of this for any other data in this
09:42.720 --> 09:49.720
away from this radial center will be minimum.
That is the idea. The radial basis function
09:53.170 --> 09:58.069
in a hidden layer produces a significant non-zero
response only when the input falls within
09:58.069 --> 10:00.800
a small localized region of the input space.
10:00.800 --> 10:07.800
Each hidden unit has its own receptive field
in the input space, which I just told you
10:11.709 --> 10:18.709
– the output will become maximum for the
radial center when the input data belongs
10:22.040 --> 10:28.720
to its receptive field. An input vector xi
that lies in the receptive field for center
10:28.720 --> 10:35.160
cj would activate at cj and by proper choice
of weights, the target output is obtained.
10:35.160 --> 10:42.160
The output is given as y equal to sigma phij
wj, where phij is phi of x minus cj known;
10:43.699 --> 10:50.360
actually, this phi represents a function – what
I said earlier. As I have already told, the
10:50.360 --> 10:57.360
output y is a linear function of the outputs
of the radial center, which is phij and phij
11:02.829 --> 11:09.829
is a nonlinear function of x, a nonlinear
function of distance between x and cj; phi
11:11.329 --> 11:16.970
is some radial centers.
11:16.970 --> 11:23.970
What are these radial functions? There are
certain popular radial functions. If I define
11:24.309 --> 11:31.309
z as a Euclidean known between x, the input
data, and the j th center…. We saw that
11:33.679 --> 11:40.679
we have h centers and the distance is Euclidean
distance between the j th center and x. You
11:43.179 --> 11:50.179
can easily say that this is the Euclidean
distance between x and cj. The types of radial
11:52.600 --> 11:59.600
functions that we normally use are Gaussian
radial function and thin plate spline function.
11:59.899 --> 12:06.119
Gaussian is very popular: e to the power minus
z square by 2 sigma square, where z is the
12:06.119 --> 12:13.119
Euclidean distance between the data and the
radial center. Similarly, the thin plate spline
12:14.709 --> 12:21.709
function is z square log z. phi z is z square
plus r square to the power of 1 by 2 is a
12:23.809 --> 12:30.009
quadratic function and is the third one. The
inverse quadratic is just 1 upon z square
12:30.009 --> 12:37.009
plus r square to the power 1 by 2, that is,
root square of z square plus r square. The
12:38.959 --> 12:45.959
Gaussian function has been very popular while
selecting the radial center.
12:46.679 --> 12:51.889
What is the normal difference between a radial
basis function network and multilayer network?
12:51.889 --> 12:57.369
The difference is that a radial basis function
network has a single hidden layer and a multiple
12:57.369 --> 13:04.369
layer network has multiple hidden layers.
MLN has multilayer network and RBFN has a
13:06.660 --> 13:10.939
single hidden layer. The basic neural model
as well as the function of the hidden layer
13:10.939 --> 13:16.339
is different from that of the output layer.
In the output layer, we have simply summation,
13:16.339 --> 13:23.339
where the hidden computation is a function
of the Euclidean distance between input and
13:26.059 --> 13:33.059
the center, whereas the computational units
in multilayer networks are all similar. The
13:34.259 --> 13:41.259
hidden layer is nonlinear, but the output
layer is linear. Here, all layers are nonlinear
13:41.269 --> 13:47.829
but not necessarily – sometimes in multilayer
network also, we can make the output layer
13:47.829 --> 13:49.869
to be linear.
13:49.869 --> 13:56.869
In the radial basis function network, the
activation function of the hidden unit is
14:01.019 --> 14:08.019
a function of of the
Euclidean distance between the input vector
and the center of that unit. The activation
14:23.679 --> 14:28.279
function is a function of the Euclidean distance
between the input vector and the center of
14:28.279 --> 14:33.910
that unit in a radial basis function network,
whereas in a multilayer network, the activation
14:33.910 --> 14:40.730
function computes the inner product of the
input vector and the weight of that unit.
14:40.730 --> 14:46.429
In a sense, this is all right.
14:46.429 --> 14:50.449
The objective here, whatever is written is
just a comparison. What is happening is that
14:50.449 --> 14:55.100
the activation function is a function of the
Euclidean distance between the input vector
14:55.100 --> 15:01.049
and the center, whereas this is the inner
product – the activation function is a function
15:01.049 --> 15:08.049
of inner product. It is not activation function.
The idea should be very clear to all of you
15:11.309 --> 15:18.309
that here, the function phi is a function
of x minus Cj, whereas this is the sigmoidal
15:28.049 --> 15:35.049
function here. If I say it is also phi, then
in this case, it is f of x transpose w, the
15:42.199 --> 15:45.299
connection weight associated with the input.
15:45.299 --> 15:51.739
Such a weight is absent in this case, the
radial basis function network. The radial
15:51.739 --> 15:57.979
basis function network establishes a local
mapping and is hence capable of fast learning.
15:57.979 --> 16:04.979
The multilayer network constructs a global
approximation to the input/output mapping.
16:05.579 --> 16:12.579
The learning in a radial basis network function
consists of two different categories of parameters:
16:12.939 --> 16:19.939
one is the radial centers and the other is
the connection weights in the output layer,
16:24.679 --> 16:31.679
whereas in the multilayer network, the only
parameters are the synaptic weights.
16:32.919 --> 16:39.919
In a radial basis function network, the parameters
are C1, C2 up to Ch – these are all centers
16:49.829 --> 16:56.829
and w1, w2, up to wn are weights. The weights
are in the output layer and the centers are
16:59.459 --> 17:06.459
in the hidden layer.
The training of RBFN requires optimal selection
17:07.549 --> 17:14.549
of the parameter vectors c and w, the weight
vector. We have to select in such a way that
17:24.769 --> 17:28.759
given this set of data x and y d, the cost
function half sigma over all patterns, i equal
17:28.759 --> 17:35.759
to 1 to M number of patterns into yd minus
y square (yd is the desired output and y is
17:38.130 --> 17:45.130
the radial basis function output) must be
minimized. The objective is select Cis and
18:05.440 --> 18:12.440
wis such that this particular function is
minimized. This is the objective. The following
18:23.720 --> 18:29.980
techniques are used to update the weights
and centers of an RBFN. There are many methods
18:29.980 --> 18:36.980
by which these centers and weights are all
optimized. One is the very simple pseudo-inverse
18:38.679 --> 18:45.679
technique, which is an offline technique;
the gradient descent learning and hybrid learning
18:46.179 --> 18:50.840
are online techniques. Gradient descent is
the same as back propagation that we talked
18:50.840 --> 18:53.370
about in multilayer network.
18:53.370 --> 19:00.370
What is a pseudo-inverse technique? This is
a least square solution. Let me tell you what
19:06.759 --> 19:13.759
we will do here. We have inputs, then we have
hidden units, then we have an output unit.
19:15.899 --> 19:22.899
Given your x p, we compute what is y p. What
we do here in pseudo-inverse technique is
19:33.610 --> 19:40.610
that the radial centers are fixed. How do
you fix the radial center? The radial center
19:47.090 --> 19:54.090
represents the clusters in input data space.
This is my input data space. I have a stack
20:01.120 --> 20:08.120
of input data. What I can do is that I can
randomly sample this data and I assign each
20:11.250 --> 20:18.250
random sample from this data to Cis. What
I do is I assign a random sample from the
20:26.379 --> 20:33.379
input space to Ci. From the input space or
input data, I select a random sample and assign
20:34.649 --> 20:38.120
to Ci and like that, I assign all the centers.
20:38.120 --> 20:44.759
The objective is that centers should be selected
so that they are almost uniformly distributed
20:44.759 --> 20:49.490
in the data. That means wherever there are
more data points, more data should come from
20:49.490 --> 20:56.009
there and less data should come from where
there are less data points. For example in
20:56.009 --> 21:00.080
this zone, there are more data points and
so, more number of points should come and
21:00.080 --> 21:07.080
this should be assigned as a radial center.
These data points are very sparsely located
21:08.970 --> 21:14.460
in this zone. Then, very few points should
come and we assign as the radial center – that
21:14.460 --> 21:18.649
is the objective. That way, we can fix the
radial center.
21:18.649 --> 21:23.960
Once the radial center is fixed, then the
outputs are known – these phiis are fixed,
21:23.960 --> 21:28.809
because phiis are based on whatever the center;
centers are fixed, so phiis are same, phii
21:28.809 --> 21:35.809
is fixed given a specific data point. Given
a specific data point, what do we have now?
21:42.200 --> 21:49.200
I have to write like this. phi is a vector,
w also is a vector, so I have to write phii
21:55.960 --> 22:02.960
wi. This is right, phii wi is the same as
phi transpose w. This phii wi is now y p.
22:11.919 --> 22:18.919
For each x p, I can collect this kind of equation.
So how many unknowns do I have? The ws are.…
22:28.649 --> 22:35.649
We can see here the number of parameters now
to be estimated are…. This is actually not
22:37.789 --> 22:44.789
n, this is h, wh. The number of weights is
h because the centers are h; the number of
22:47.129 --> 22:54.129
weights is h for a single output radial basis
function network.
23:00.379 --> 23:07.379
We have wis, i equal to 1 to h. For a given
input pattern, we find out this equation phi
23:17.690 --> 23:24.690
transpose w is y p. Like that, how many equations
do we have? p is equal to 1 to M, where M
23:35.549 --> 23:42.549
is the number of data patterns. We have these
many equations. How many unknowns? The unknowns
23:53.519 --> 24:00.519
are h weights – these are unknowns. Normally,
the data patterns are many and h is usually
24:07.450 --> 24:14.450
much less than M. In essence, you have more
number of equations but less unknowns. So,
24:24.029 --> 24:28.999
we can easily find a least square fit. That
is the objective here.
24:28.999 --> 24:34.409
This is a least square problem. Assume a fixed
radial basis function, a Gaussian function.
24:34.409 --> 24:41.409
We assume that all the radial centers in the
hidden units are Gaussian functions. The centers
24:42.350 --> 24:49.309
are chosen randomly as I said. The standard
deviation of the radial function is determined
24:49.309 --> 24:56.190
by an ad hoc choice. The learning steps are
as follows: the width is fixed according to
24:56.190 --> 25:03.190
the spread of the centers. Your phii is an
exponential function (like a Gaussian function)
25:07.710 --> 25:14.710
e to the power of minus h upon d square and
this is the known – x minus c, the Euclidean
25:14.960 --> 25:21.960
distance between x and ci square. This is
the distance square. h is the number of centers
25:26.600 --> 25:32.279
and d is the maximum distance between the
chosen centers. What is the meaning of this
25:32.279 --> 25:37.289
maximum distance between the chosen centers?
If we go back….
25:37.289 --> 25:44.289
Let us say that we have two-dimensional data
x1 and x2. I have a data point, one cluster
25:50.570 --> 25:57.570
here, another cluster here and all other clusters
are like this. Obviously, for this cluster
26:03.820 --> 26:10.820
and this cluster, the distance between these
is maximum. Similarly, once you have fixed
26:13.470 --> 26:20.470
the radial centers, find out two centers that
have maximum distance between them.
26:24.059 --> 26:31.059
Once you do that, then that is d, that maximum
distance is d. Obviously, if you look at that,
26:35.360 --> 26:42.360
if I write e to the power of minus x minus
Ci, the Euclidean distance square by sigma
26:42.629 --> 26:49.629
square, then sigma will become d upon root
2 sigma squared. Normally, the function is
26:50.759 --> 26:57.759
like this: e to the power of minus z square
by 2 sigma square. This is the normal form
26:58.840 --> 27:05.840
of a Gaussian function and if I represent
this function as this, which is written here,
27:07.110 --> 27:14.110
Then obviously sigma is d upon root of 2 h
and d is the maximum distance between any
27:20.210 --> 27:25.470
two radial centers.
27:25.470 --> 27:32.470
Now, we will formulate the problem. The problem
is let us say phi is phi1, phi2 up to phih
27:36.059 --> 27:43.059
is a row vector that is taken here. This is
a row vector phi. These are the outputs of
27:44.419 --> 27:51.419
the radial centers. w is the weight vector.
This is a column vector. phi w is y d, where
27:58.490 --> 28:05.490
y d is the desired output. Now, we are only
talking about a single output RBFN. You can
28:11.590 --> 28:16.740
also talk about multilayer output – it is
the same, the formulation will be the same.
28:16.740 --> 28:23.740
The required weight vector is computed as….
What I am trying to do here is that we will
28:27.850 --> 28:34.279
find a pseudo-inverse technique, how to solve
a radial basis function network.
28:34.279 --> 28:41.279
For a p th pattern, phi p is a vector of phi1
p, phi2 p up to phih p. These are the outputs
28:42.730 --> 28:49.730
of the radial centers. The weight is w1, w2
… wh and the corresponding output is yd
28:52.059 --> 28:59.059
p. Similarly, we have M patterns. If I write
this in terms of matrix, I can write a capital
29:01.549 --> 29:08.549
phi, which is phi1 1, phi2 1 up to phih 1.
These are the radial center outputs for the
29:16.450 --> 29:23.450
first pattern. The second pattern is phi1
2 phi2 2 … phih 2. The M th pattern is phi1
29:28.639 --> 29:35.639
M, phi2 M and so on up to phih M. This is
my phi and correspondingly, each one is multiplied
29:46.100 --> 29:49.639
with W.
29:49.639 --> 29:56.639
Let me not put it here. This I put here. This
is my big phi. I multiply this with w1, w2,
30:05.269 --> 30:12.269
… wh, because they are fixed. The weights
are fixed. The pseudo-inverse technique is
30:23.309 --> 30:28.309
offline training. It will allow the patterns
to pass through the network, compute the yd,
30:28.309 --> 30:35.309
store them for M patterns and while doing
all these things, the weights are all fixed.
30:36.759 --> 30:43.759
Hence, what we are doing is that we are keeping
all these relation vectors which is phi1 to
30:44.980 --> 30:51.980
phi2, this phi vector, and we are equating
with the output. Obviously, if we multiply
30:52.039 --> 30:59.039
phi1 1, phi2 1 up to phih 1 with this weight
vector, the corresponding output is yd 1 pattern.
31:00.919 --> 31:07.919
The second one is yd 2 and the m the one is
yd M. What we have got is a matrix notation.
31:13.879 --> 31:20.879
This is phi, this is weight and this is Y.
Now, you can easily see that because the output
31:30.749 --> 31:36.200
has a linear relationship with weight, we
can always represent in terms of matrix format
31:36.200 --> 31:43.200
and you can easily see that Y is an m by 1
vector, W is an h by 1 vector and phi is an
31:52.610 --> 31:57.929
m by h vector.
31:57.929 --> 32:04.929
The solution will be that capital phi into
W
is Y. We found out this is M by h, h by 1
32:18.649 --> 32:25.649
and this is M by 1. Since this is not a square
matrix, we cannot invert it. The solution
32:28.769 --> 32:35.769
is very easy to find out when phi is a square
matrix and is invertible but when it is not
32:39.919 --> 32:46.919
a square matrix, what we can do is we can
use the pseudo-inverse technique. What is
32:49.779 --> 32:56.779
pseudo inverse? I multiply phi transpose phi
W. phi transpose is h into M cross M cross
32:59.429 --> 33:06.429
h. So, this is an h cross h matrix
and W is h cross 1. Here, you multiply phi
transpose Y. This is h M M 1 and this is again
33:27.869 --> 33:34.869
h cross 1. Everything is satisfied. Finally,
W is (phi transpose phi) inverse into phi
33:40.210 --> 33:47.210
transpose Y because this is now a square matrix
but this solution is possible only when….
33:50.600 --> 33:57.600
The solution we got for W is now (phi transpose
phi) inverse phi transpose Y. This is known
34:08.770 --> 34:15.770
as pseudo.… The solution is possible
only when (phi transpose phi) inverse exists.
Even otherwise, we can use the singular value
34:33.790 --> 34:40.790
decomposition method when this is singular;
singular means its determinant is 0.
35:00.250 --> 35:07.250
There was a little mistake there and I corrected
that. What you found out is that the required
35:10.880 --> 35:17.880
weight vector w is (phi transpose phi) inverse
phi transpose y d and this is known as the
35:18.520 --> 35:23.970
pseudo-inverse of phi. This is possible only
when phi transpose phi is non-singular. If
35:23.970 --> 35:30.030
this is singular, singular value decomposition
used to solve for w.
35:30.030 --> 35:37.030
This is an example. We take an EX-NOR function:
0 0 1, 0 1 0, 1 0 0, 1 1 1. This is an EX-NOR
35:43.970 --> 35:50.970
function. You select two centers. Centers
can be selected randomly from the input space.
35:51.660 --> 35:57.710
In input space, we only have four data and
we can select any one of them. Let us select
35:57.710 --> 36:04.710
0 0 and 1 1 as the centers. c1 = 0 0 is one
of the centers and the other center is 1 1.
36:07.880 --> 36:14.880
So, phi1 is exponential minus (x minus c1)
square and phi2 is e to the power of (x minus
36:18.130 --> 36:23.460
c2), Euclidean square, where x is x1 and x2.
36:23.460 --> 36:30.460
The output y is w1 phi1 plus w2 phi2 plus
theta. If you look here, theta is the bias.
36:33.750 --> 36:39.859
This is the architecture we selected. This
is radial basis function network architecture
36:39.859 --> 36:46.859
selected to learn the function EX-NOR. We
have taken two radial centers and one bias.
36:53.180 --> 37:00.180
The bias input is +1, the bias weight is theta,
y is w1 phi1 plus w2 phi2 plus theta and phi1
37:03.730 --> 37:07.099
phi2 are computed according to this.
37:07.099 --> 37:14.099
We have four training patterns. If we compute
for each pattern the phi1, the phi1 for the
37:19.210 --> 37:26.210
first pattern is 1 because you can easily
see here for the first pattern 0 0 and c1
37:29.480 --> 37:35.230
is 0 0. Obviously, e to the power this Euclidean
distance is 0, so e to the power of 0 is 1.
37:35.230 --> 37:41.270
That is what you are saying. w1 into phi1,
phi1 is 0 here and phi2 is e to the power
37:41.270 --> 37:48.270
of minus 2, because the Euclidean distance
is root 2 and square is 2, so e to the power
37:48.460 --> 37:55.460
minus 2. To let you know, we have only taken
e to the power Euclidean distance square;
37:58.740 --> 38:05.740
that means we have already selected sigma
equal to 1. It implies that 2 sigma square
38:06.950 --> 38:13.359
is 1 in the Gaussian function.
38:13.359 --> 38:20.359
Based on that for each pattern…. This is
for 0 0, this is the next pattern 0 1, 1 0
38:22.790 --> 38:29.790
and 1 1. Obviously, for 1 1, phi2 is 1. These
are the four equations we have for four data
38:32.640 --> 38:39.640
patterns. If you form a matrix phi into w
equal to y d, phi W is Y d. If you put this,
38:44.680 --> 38:51.680
then y d is (1, 0, 0, 1), that is, y d is
(1, 0, 0, 1) and this is your phi and w. w
39:00.079 --> 39:07.079
is (w1 , w2, theta) – we have two weights
and one bias. If we solve using pseudo-inverse
39:07.869 --> 39:14.869
technique, where w is (phi transpose phi)
inverse phi transpose, this is this value
39:26.180 --> 39:33.180
(2.5, 2.5, and ?1.8), which is a column vector.
This is the solution. To conclude, if we have
39:43.930 --> 39:50.930
already some set of input/output data, we
have no restriction for online training, then
39:54.680 --> 39:58.020
pseudo-inverse technique is good.
39:58.020 --> 40:05.020
If a data set is already available and no
demand on online training, then pseudo-inverse
40:36.410 --> 40:43.410
technique is a good approach to
find the weight vector W. This is our weight
vector. We talked about the pseudo-inverse
40:57.730 --> 41:04.730
technique where we fix the radial center,
but imagine the situation where the data is
41:05.150 --> 41:12.150
coming online and you have to do online training.
For example, in a control system, when we
41:15.859 --> 41:22.859
do a control system, data is coming online
and we have to train our controller online.
41:24.740 --> 41:31.740
In that case, how should the training be done?
Gradient descent as usual.
41:33.349 --> 41:40.349
Instantaneous gradient decent is a good method
for it. This is again the same methodology
41:42.170 --> 41:49.170
–once you are given instantaneous cost function
E, cij t plus 1 is cij of t minus eta1 del
41:50.730 --> 41:57.730
E upon del cij and this can be easily updated.
Similarly, for weight also, we can do another
41:58.780 --> 42:05.780
gradient descent, where the instantaneous
cost function y d minus y. These are instantaneous
42:06.819 --> 42:09.920
values; these are all scalars, not vectors.
42:09.920 --> 42:16.920
Just to give an example, I will be little
faster here because we have already discussed
42:18.700 --> 42:25.700
a lot about gradient descent derivation. Our
y is phii wi. phii is e to the power minus
42:26.380 --> 42:33.380
zi square upon 2 sigma square, where zi is
x minus ei – the Euclidean distance between
42:34.210 --> 42:41.210
x and ci, and sigma is the width of the center.
Differentiating E with respect to wi, you
42:43.130 --> 42:50.130
get this particular thing. Obviously, the
weight
update is wi t plus 1 is wi t plus eta into
43:04.530 --> 43:11.530
error y d minus y into phii. This is the weight
update for weights in a radial basis function
43:14.609 --> 43:20.270
network. It is very simple because it is a
linear network.
43:20.270 --> 43:27.270
The weight update of center…. Imagine each
center has p elements because x, the input,
43:28.599 --> 43:35.599
is a p-dimensional vector. That is why if
you look at the derivation del E upon del
43:36.210 --> 43:43.210
ci, this is this particular thing. First,
you differentiate E with respect to y, y with
43:44.559 --> 43:51.559
respect to phii and phii is a function of
cij. cij is not there in any other radial;
43:54.700 --> 44:01.700
only the i th radial center contains the element
cij – this is important and hence this expression
44:03.579 --> 44:10.579
is right. We already know del E upon del y
is this, del y upon del phii was wi and to
44:12.510 --> 44:19.289
compute phii by this, we differentiate phii
with respect to zi, zi with respect to cij.
44:19.289 --> 44:26.289
When you differentiate phii with respect to
zi, we get zi upon sigma square phii negative
44:27.470 --> 44:34.470
and similarly, for zi with respect to cij,
we get this expression. This is negative,
44:36.059 --> 44:41.559
this is also negative, this is also negative,
so overall, the sign of this expression is
44:41.559 --> 44:43.960
negative.
44:43.960 --> 44:50.960
cij t plus 1 is thus our final expression
for Gaussian centers and this is for weights.
45:02.920 --> 45:09.920
This is a simple derivation and you can verify
for yourselves. Using this online training
45:12.199 --> 45:18.420
– instantaneous gradient descent, we will
now do a system identification of a surge
45:18.420 --> 45:25.420
tank. We have already discussed what a surge
tank is. A surge tank is there to minimize
45:28.760 --> 45:35.760
the effect due to sudden pressure in the water
reservoir. Normally, in a hydro power plant,
45:38.240 --> 45:44.700
we have a water reservoir and suddenly, the
level increases. From the water reservoir,
45:44.700 --> 45:51.700
there is a connection to the turbine. If a
sudden increase is there, then the flow increases.
45:55.770 --> 46:02.770
To maintain the same flow, we place a surge
tank here. The flow goes up and the liquid
46:09.930 --> 46:16.930
level in the tank increases. In the surge
tank, the volume has a nonlinear relationship
46:21.579 --> 46:28.579
with the level h. Then, one of the models
of such a surge tank is this. It is a nonlinear
46:31.530 --> 46:38.530
model h t plus 1 is equal to h t plus T into
this quantity, where h t is the liquid level
46:40.640 --> 46:46.329
and u t is the flow inside this surge tank.
46:46.329 --> 46:53.329
What you are seeing here.… Let me do it
like this. This is our water reservoir, this
46:59.200 --> 47:04.349
is the surge tank and we are only concerned
with the model of this surge tank. This model
47:04.349 --> 47:11.349
of the surge tank is that if there is a certain
flow rate into the tank, then how the level
47:13.079 --> 47:19.480
of the tank increases. For a nonlinear thing,
this is the… for a discrete dynamic model
47:19.480 --> 47:26.480
of the surge tank. Please see this – T into
minus square root of 2 g h t upon square root
47:27.670 --> 47:34.670
of (3 h t plus 1) plus u t upon square root
of 3 h t plus 1, where u t is the water flow
47:35.400 --> 47:42.400
into the surge tank, h t is the liquid level,
g is the acceleration due to gravity, T is
47:43.119 --> 47:49.760
the sampling time and t is the sampling instant.
47:49.760 --> 47:53.839
We have generated data in a similar manner
–we have done it earlier. Sampling time
47:53.839 --> 48:00.839
is 0.01 second and 150 data have been generated
using this data. Input flow is according to
48:00.890 --> 48:07.890
this particular curve and corresponding h
t liquid level. This is your u t and this
48:08.430 --> 48:15.430
is your corresponding liquid level.
48:16.680 --> 48:22.160
We selected a radial basis function network.
Obviously, it has two inputs and one target.
48:22.160 --> 48:28.339
The two inputs are u t and h t, the target
is h t plus 1, the units in hidden layer are
48:28.339 --> 48:34.730
30, numbers of input/output data is 150, the
radial basis function is Gaussian, the width
48:34.730 --> 48:41.730
of the radial function sigma is 0.707, the
center learning rate eta1 is 0.3 and the weight
48:43.289 --> 48:49.150
learning rate is 0.2.
48:49.150 --> 48:53.619
You can easily see that this is the convergence.
Within 200 or less than 200 epochs, the convergence
48:53.619 --> 49:00.619
is achieved to achieve the root mean square
error below 0.007. When we give a new input,
49:06.869 --> 49:13.869
the input flow follows this particular curve,
then the liquid level here. You can see the
49:17.900 --> 49:24.900
red curve and the green curve. Over the red
curve, there is a green curve and red is desired
49:33.109 --> 49:40.109
and green is actual. You can easily see that
the RBFN model that has been trained, which
49:47.700 --> 49:54.700
is 200 epochs, could easily map or could easily
learn the dynamic of the surge tank.
49:58.349 --> 50:03.049
We talked about two different learning: pseudo-inverse
and gradient descent. Now, we will talk about
50:03.049 --> 50:08.890
a new kind of learning that is normally employed
for radial basis function network, which is
50:08.890 --> 50:15.539
called hybrid learning. What is hybrid learning?
Center and weights are separated. Centers,
50:15.539 --> 50:21.230
since they represent the clusters in the input
space, we can use unsupervised learning to
50:21.230 --> 50:28.230
learn the centers, whereas it is supervised
learning for the weights. Hybrid learning
50:31.140 --> 50:38.140
means unsupervised learning for centers and
supervised learning
for weights. In hybrid learning, the radial
50:56.010 --> 51:01.150
basis functions relocate their centers in
a self-organized manner, that is, unsupervised
51:01.150 --> 51:05.430
learning, while the weights are updated using
supervised learning.
51:05.430 --> 51:10.789
When a pattern is presented to RBFN, either
a new center is grown if the pattern is sufficiently
51:10.789 --> 51:16.740
novel or the parameters in both layers are
updated using gradient descent. The test of
51:16.740 --> 51:20.760
novelty depends on two criteria: Is the Euclidean
distance between the input pattern and the
51:20.760 --> 51:24.920
nearest center greater than a threshold? Is
the mean square error at the output greater
51:24.920 --> 51:31.920
than a desired accuracy? A new center is allocated
when both criteria are satisfied.
51:34.859 --> 51:41.859
Normally, the easiest way to do this center
learning using the classical clustering is
51:47.900 --> 51:54.900
that we fix the number of centers and assign
them random vectors from the input space – this
52:20.160 --> 52:27.160
is the first step, this is step 1. This is
step 2. In step 2, what will do is once the
52:32.740 --> 52:39.740
centers are fixed, uniformly sampled from
the input space, then what we do is we present
52:42.849 --> 52:49.789
an input pattern, find the Euclidean distance
between this input pattern and all these centers
52:49.789 --> 52:56.190
which are already fixed, the numbers are also
fixed, and then find the winner. Whichever
52:56.190 --> 53:03.190
is the winner, you update the weight of that
winner. This is for the winner and for all
53:04.660 --> 53:11.260
other centers, we do not do any changes, the
centers remain as is.
53:11.260 --> 53:18.260
We repeat this process for all the data patterns
from the training set. This is called K-means
53:21.940 --> 53:28.940
clustering. Not only can radial centers use
K-mean clustering, but there are other clustering
53:34.630 --> 53:39.900
techniques. We will not focus on that now,
we will just give an idea how this is done
53:39.900 --> 53:46.410
– clustering, unsupervised manner. What
about the weights? Weights can be used because
53:46.410 --> 53:53.410
weight and the output have a linear relationship.
We can use any least mean square, we can also
53:54.549 --> 53:58.849
use gradient descent. Apart from gradient
descent, we can also use the least mean square
53:58.849 --> 54:04.119
algorithm or recursive least square algorithm.
54:04.119 --> 54:11.119
This is a recursive least square algorithm.
I will not discuss this in this class, but
54:11.609 --> 54:18.609
maybe later. The surge tank can also be modeled
using this hybrid learning, where K-means
54:24.500 --> 54:31.500
clustering has been used with a learning rate
0.5, which is alpha. Alpha is 0.5. The gradient
54:39.599 --> 54:46.380
descent method has been used for weight update,
where the eta is… this is eta, this is alpha
54:46.380 --> 54:53.380
and the training is terminated when root mean
square error was less than 0.007.
54:55.869 --> 55:02.869
You can easily see that in the beginning,
when we reshuffled this, the centers were
55:05.859 --> 55:12.859
uniformly distributed. These are the centers.
The circles are the centers and this is my
55:14.720 --> 55:21.720
input data. This is my u and this is my h.
What you are seeing is that before unsupervised
55:34.069 --> 55:41.069
learning, the centers are all randomly distributed
– this is the data and as we presented data
55:45.250 --> 55:52.250
to the centers using K-means clustering, you
see that most of these centers were aligned
55:53.710 --> 56:00.710
with the input data and very few are left
unaligned.
56:06.220 --> 56:12.420
This is the final comparison of the result.
What we are seeing is that when we train a
56:12.420 --> 56:19.420
surge tank, do the system identification of
a surge tank using back propagation network,
56:19.450 --> 56:24.130
that is, multilayer network, the radial basis
function network using gradient descent, radial
56:24.130 --> 56:30.710
basis function network (hybrid learning),
then the number of iterations that are required…
56:30.710 --> 56:37.710
you see that back propagation takes a long
time and the radial basis function network
56:39.579 --> 56:46.579
takes less time. The RMS error is the same
because we have fixed the RMS error – the
56:50.769 --> 56:56.450
training is terminated over the same RMS error.
Obviously, the same RMS error for the new
56:56.450 --> 57:03.450
data… that means generalization for all
the three networks are the same.
57:05.970 --> 57:12.970
The conclusion is that we discussed today
RBFN network and we talked about three types
57:17.230 --> 57:24.230
of learning. The first one is pseudo-inverse,
which is offline training, the second is normal
57:35.990 --> 57:42.990
gradient descent and the third one is hybrid,
which is a combination of unsupervised plus
57:49.490 --> 57:56.490
supervised. Thank you very much.
357