WEBVTT
00:19.529 --> 00:26.529
Welcome everybody. This would be a lecture
on weight update rules for neural networks,
00:31.960 --> 00:38.960
various possible schemes. In this course on
intelligent control, we have already discussed
00:42.379 --> 00:49.379
a various feed forward neural networks, multilayer
neural networks, radial basis function neural
00:49.549 --> 00:56.549
network and associated algorithms for these
networks. Also it proposes in the last class,
00:59.010 --> 01:06.010
a very interesting algorithm for weight update
that was the adaptive learning rate using
01:09.300 --> 01:16.300
lyapunov function. We would now, summarize
the various schemes that are being employed
01:19.700 --> 01:26.700
to improve on the convergence property, as
well as the global convergence criteria for
01:31.140 --> 01:34.229
minimizing a cost function given a network.
01:34.229 --> 01:40.750
So, it would be a kind of a summary to let
you know, what is the state of point of various
01:40.750 --> 01:47.750
weight update rules for specifically feed
forward networks. Topics that would be covered
01:50.409 --> 01:57.409
today would be conventional training algorithm,
gradient descent that we have already being
01:57.579 --> 02:04.579
talking about, which is popularly known as
back propagation algorithm for feed forward
02:05.570 --> 02:12.570
networks, Newton’s method quasi newton method,
conjugate gradient method. And these methods
02:14.299 --> 02:21.299
would be kind of compared with our schemes
also that we propose last time that is our
02:25.980 --> 02:32.980
personal work. The work of our group at I
I T Kanpur on adaptive learning rate and then
02:33.720 --> 02:40.150
we would kind of summarize everything that
we have learnt.
02:40.150 --> 02:46.330
So, introduction feed forward neural networks
are used as function approximators for learning
02:46.330 --> 02:53.330
mappings between input and output space. A
neural network is represented as Y equal to
02:56.140 --> 03:03.140
f W X that is, if I have a neural network
here, this is some neural network then my
03:05.640 --> 03:12.640
output vector is Y and input vector is X neural
network is characterized by weight vector
03:23.180 --> 03:30.180
W then my output of neural network is a non-linear
function of weight vector W and input X, weight
03:32.959 --> 03:38.799
vector W is updated in such a way that is
specific cost function is minimized such that
03:38.799 --> 03:41.879
network can predict for the test data.
03:41.879 --> 03:48.879
So, we give some data, we generate some training
examples for this network for training and
03:50.530 --> 03:57.530
then with another set of such examples, we
taste the predictability of this network.
03:59.110 --> 04:06.110
Desirable features of learning algorithm:
locating global minimum of the cost function;
04:08.349 --> 04:15.349
fast convergence; good generalization that
is learning from minimum examples. So, I take
04:17.530 --> 04:24.530
minimum examples, and then this network is
provided with minimum example and from that
04:27.940 --> 04:34.940
example, it can if the network in extract
the map of the entire data that is there in
04:38.000 --> 04:44.430
input-output space for that specific domain
and the prediction is pretty nice, then that
04:44.430 --> 04:51.310
is a good generalization, less computational
complexity. So, this is an important, we do
04:51.310 --> 04:57.039
not want to have an algorithm that takes a
long time for computation to a very fast like,
04:57.039 --> 05:04.039
back propagation algorithm with all its drawbacks
is very fast simply a first order approximation
05:07.970 --> 05:14.970
of taylor series, approximation the back propagation
that tries to minimize. So, it is very fast
05:15.520 --> 05:18.660
this algorithm is…
05:18.660 --> 05:25.660
So, the traditionally that people have tried
to propose various algorithms for feed forward
05:30.830 --> 05:35.750
neural networks including the even feedback
neural networks that we have already talked
05:35.750 --> 05:42.750
all these things can be applied both to feedback
and feed forward network for system identification,
05:42.970 --> 05:47.300
because we are talking about intelligent control.
So, when we talk about neural network, we
05:47.300 --> 05:52.770
are talking in terms of either system identification
or control. So, all these gradient descent
05:52.770 --> 05:58.630
Newton’s method, quasi newton method, conjugate
gradient method, all these methods the variant
05:58.630 --> 06:05.630
of these methods have been applied to neural
network to varying degree of success.
06:06.509 --> 06:13.509
So, normally a cost function is a function
of a weight, you see that this is… so, while
06:19.340 --> 06:23.740
training a neural network, the cost function
is usually written as this, where this is
06:23.740 --> 06:30.740
your desired output, this is your actual output
predicted by the neural network and then for
06:31.349 --> 06:38.349
and p is the stands for pattern. So, if I
have N patterns. So, this total error square
06:38.910 --> 06:45.910
of the errors should be minimized. Normally,
this is a quality cost function but not necessarily
06:46.599 --> 06:53.599
we will go for this function but usually researches
they use this kind of function this type of
06:54.720 --> 07:01.720
function. And here, if you look at what is
y p? This y p is the output of the neural
07:04.530 --> 07:11.530
network and y p is function of neural network
weight and x p is the neural network input.
07:12.650 --> 07:19.650
So, this x p is input y p is output. So, in
that sense you can easily see that E is actually
07:24.379 --> 07:31.379
a function of W because since y p d and x
p are known.
07:43.830 --> 07:50.830
Given a training example, given a training
set, then I know what is y p d in y p also
07:52.250 --> 07:59.099
this function, I know what is x p; so, all
that I do not know is the weight vector. So,
07:59.099 --> 08:06.099
in that since my cost function is simply a
function of weight vector. Thus, E is a non-linear
08:09.870 --> 08:16.159
function of weight vector W and it will be
represented as E W, you understand why we
08:16.159 --> 08:23.159
are writing, E W means E is a exclusively
a function of W, when we train in neural network.
08:24.599 --> 08:31.599
The objective of learning is to find W star,
such that E W star is optimized or minimized.
08:32.390 --> 08:39.390
Usually, we minimize here. So, recursive form
of general learning algorithm is weight vector
08:40.140 --> 08:47.140
new is equal to weight vector old plus delta
W with an initial condition W naught, we start
08:50.100 --> 08:51.880
with some initial condition.
08:51.880 --> 08:58.880
Now, let us talk about gradient descent, I
mean what we are trying to do in this lecture
08:59.079 --> 09:03.600
is that I would like to give a comprehensive
picture of all these learning algorithm, that
09:03.600 --> 09:10.600
we are being talking about until now. So,
let us talk about a gradient descent algorithm,
09:10.689 --> 09:15.170
what is actually a gradient descent, what
it does?
09:15.170 --> 09:19.689
We understood in our previous discussion,
it certainly does not take us to or it does
09:19.689 --> 09:26.689
not guarantee us to global convergence or
theoretically the gradient descent cannot
09:30.920 --> 09:37.920
take us to to the global minimum. So, that
is if we have a cost function like this E
09:43.980 --> 09:50.980
and this is W. So, obviously if I start from
here I may reach here, if I start from here
09:51.640 --> 09:58.290
then I may reach here. So, it all depends
on the initial condition, where I am starting
09:58.290 --> 10:04.290
and since I do not know where is the global
minimum. So, how can I fix a initial condition
10:04.290 --> 10:09.569
that will be very near to the global minimum.
So, this is very difficult and no theoretical
10:09.569 --> 10:16.569
theoretically gradient descent cannot ensure
global convergence it only ensures local convergence.
10:17.410 --> 10:24.410
So, in that sense what gradient descent does?
Let us think about the function, that we talked
10:24.420 --> 10:31.420
about that we are trying to minimize E W error
a function. So, if I expand this error function
10:34.679 --> 10:41.679
around W naught some initial point. So, I
do at Taylor series expansion, then you see
10:43.850 --> 10:50.850
that E W naught plus delta E by del W delta
W first order, second order term that is del
10:53.260 --> 11:00.260
square E by del W square delta W square. This
is whole square actually and plus 1.
11:04.689 --> 11:11.689
So, this is the series, what we are talking
about and if we neglect in this expansion,
11:16.429 --> 11:22.319
second order and higher order terms that is
if I simply consider the first order Taylor
11:22.319 --> 11:29.319
series expansion of the cost function, then
this is my expression. So, the approximation
11:30.939 --> 11:37.939
first order of this is first order approximation
of E W, first order Taylor series expansion
11:54.390 --> 12:01.390
T s, T s is Taylor series. So, this is what
we have written here Taylor series. So, this
12:13.500 --> 12:20.500
Taylor series expansion of the cost function,
the first order Taylor series looks like this
12:21.660 --> 12:25.240
which is given in this block.
12:25.240 --> 12:32.240
Now, if I select delta W as minus eta deli
upon del W transpose E is a scalar, when I
12:42.780 --> 12:48.949
differentiate a scalar with respect to a vector
it becomes a row vector. So, transpose makes
12:48.949 --> 12:55.949
it a column vector and delta W is a column
vector. So, we get, if I replace this delta
12:59.809 --> 13:06.240
W. This particular expression in the previous
expression and that expression is that we
13:06.240 --> 13:13.240
wrote earlier E hat is E W naught plus del
E upon del W in delta W.
13:23.819 --> 13:30.819
So, in this replace, this delta W by this
minus eta del E upon del W transpose, if you
13:32.510 --> 13:39.510
do that than E minus E W will become this
quantity because if you take this one here
13:42.689 --> 13:49.689
so you get minus eta deli upon del W norm
square and since this is a known quantity,
13:51.079 --> 13:58.079
which is always a positive quantity. So, hence
this particular term is always negative, what
14:02.770 --> 14:09.770
it implies it implies that if I start from
W naught the next W the update would be always
14:11.319 --> 14:14.850
less than E W naught.
14:14.850 --> 14:21.850
So, that is again, when this W will become
the initial point and you go to the next point,
14:22.750 --> 14:29.750
so, it will always decrease - this will be
always decrease because it will always decrease
14:32.040 --> 14:39.040
in which sense the decrease is ensured only
when in terms of the when E W is looked at
14:40.660 --> 14:47.660
as a first order approximation of Taylor series,
but this is not true I cannot say E W is less
14:51.630 --> 14:58.610
than equal to E W naught, this is E hat W
and E hat W is a first order Taylor series
14:58.610 --> 15:01.559
approximation of the cost function.
15:01.559 --> 15:07.319
So, the gradient descent algorithm ensures
that the first order approximation E hat W
15:07.319 --> 15:13.640
reaches its global minimum, but it does not
ensure global convergence for the original
15:13.640 --> 15:20.640
cost function E W. So, in that sense what
a gradient descent does, it tries to optimize
15:22.679 --> 15:29.679
the first order Taylor series expansion of
first order, Taylor series approximation of
15:33.179 --> 15:40.179
the original cost function. So, in that sense
we only reach a local convergence.
15:42.520 --> 15:49.520
So, the gradient descent algorithm as usual
is given by this. So, learning rate eta determines
15:50.329 --> 15:56.370
the speed of convergence. Convergence depends
on proper choice of initial condition that
15:56.370 --> 16:02.280
I have already told you.
So, now let us take a simple example of a
16:02.280 --> 16:09.280
gradient descent: E W is 0.5 W square minus
8 sin W plus 7. So, this is a non-linear cost
16:10.319 --> 16:15.890
function, so it is a multimodal function with
two local minimum and a global minimum.
16:15.890 --> 16:22.699
The gradient descent update law for parameter
W is obtained as this. So, delta W is minus
16:22.699 --> 16:29.699
eta del E by del W. So, if I differentiate
this I get eta, this is 2 into 0.51. So, W
16:32.039 --> 16:39.039
minus 8 cos W and this is 0, so you see that
this particular cost function, this is actually
16:42.870 --> 16:49.870
e w. So, this E W it has two local minimum
and one global minimum. So, interestingly,
16:57.049 --> 17:04.049
if you look at the result is if I use adaptive
value eta equal to 0.01, which you are seeing
17:05.689 --> 17:12.689
that is the rate one circle one. So, if I
start from this point w not is minus eight,
17:12.929 --> 17:19.929
if I start from there. So, what I am seeing
that this comes and comes slowly converges
17:19.949 --> 17:25.660
at this point and does not go from there.
So, it actually converges to local minimum.
17:25.660 --> 17:32.660
So, if I start from here I reach here but
if I change eta equal to from 0.01 to eta
17:32.720 --> 17:39.720
equal to 0.4, if I take, then my eta size,
step size learning rate size has increased
17:40.660 --> 17:46.200
a little bit. So, you can easily see that
is square one. So, from this point it jumps
17:46.200 --> 17:51.750
to this point and from this point it goes
to this point and again from here, until it
17:51.750 --> 17:55.540
converges to global minimum in this case.
17:55.540 --> 18:01.750
So, changing this eta, I am able to again
reach in this case to the global minimum.
18:01.750 --> 18:08.340
So, when eta is 0.01 W naught is minus 8,
the final weight W equal to minus 4 corresponds
18:08.340 --> 18:14.660
to local minimum this one minus 4. When eta
equal to 0.4 and W naught is minus 8, the
18:14.660 --> 18:21.080
local minimum can be avoided and final weight
ultimately settles down at global minimum
18:21.080 --> 18:27.790
at this point. So, I reach the global minimum
but this is simple function and because this
18:27.790 --> 18:34.790
is a simple function we are able to show that
by changing this eta intelligently.
18:35.110 --> 18:42.110
We can reach it, but if the function is complex,
we have no idea, we have no comprehensive
18:45.630 --> 18:52.630
method to reach the global minimum using gradient
descent; all though heuristically we can reach,
18:54.320 --> 19:01.320
you know like near global minimum. So, observation
for small step size gradient descent gets
19:03.880 --> 19:09.840
stuck at local minimum for larger step size
it may come out of local minimum and get into
19:09.840 --> 19:15.170
global minimum algorithm may not converge
for larger step size as well. There is no
19:15.170 --> 19:21.740
comprehensive method to select initial weight
W naught and the learning rate eta for a given
19:21.740 --> 19:28.740
problem. So, that the global convergence is
achieved. So, that the global convergence
19:35.570 --> 19:42.570
is achieved. So, that is about gradient descent
now, because gradient descent is very slow
19:49.700 --> 19:56.700
in its convergence it does not guarantee global
convergence; so people thought about Newton’s
19:59.430 --> 20:06.430
algorithm where like in gradient descent the
weight update law is such that it tries to
20:07.560 --> 20:14.550
find the global minimum of the first order
Taylor series approximation of the cost function.
20:14.550 --> 20:21.550
Now in Newton algorithm, if we take the same
cost function e w and again we expand its
20:21.920 --> 20:28.920
Taylor series and we take the second order
of approximation here. So, if we take second
20:29.780 --> 20:36.780
order approximation. So, here, let us define
g as del E upon del W the gradient vector
20:42.950 --> 20:49.950
and there is a Hessian matrix that is defined
as H, which is does square E upon doe W doe
20:50.390 --> 20:56.840
W transpose. So, neglecting higher order terms,
we have following second order approximation
20:56.840 --> 21:03.840
of the cost function. So, this is the second
order approximation; here E W is the second
21:08.520 --> 21:15.520
order Taylor series approximation of actual
cost function E W and that we write this particular
21:27.360 --> 21:34.360
form. So, the all other higher order terms
from this point onwards has been neglected.
21:34.360 --> 21:40.280
So, this is a quadratic function of delta
W. The necessary condition for minimization
21:40.280 --> 21:47.280
of E hat W, which is the second order approximation
of actual cost function is given by del E
21:48.430 --> 21:55.430
upon del W is 0, which gives the following
update law for weight vector, which delta
21:57.240 --> 22:04.240
W H inverse g, so that is if I go here and
if I differentiate this del E W by del W this
22:08.630 --> 22:15.630
is a constant term goes away. So, differentiating
this I get delta W is H inverse g; this is
22:26.330 --> 22:33.330
my weight update law
using Newton’s algorithm.
22:43.620 --> 22:50.620
But unfortunately, of course now once I take
this, you can easily see that E hat minus
22:53.130 --> 23:00.130
e w not is always a negative definite quantity
provided not always it is negative definite
23:01.510 --> 23:08.510
only when H is positive definite; so this
is true provided H is positive definite means
23:11.490 --> 23:18.490
H is positive definite. I hope all of you
understand positive definite weight, if I
23:22.010 --> 23:29.010
take any vector extras force H x, if H is
a positive definite matrix, this quantity
23:36.880 --> 23:43.880
is always a positive quantity for any accept,
any in all vector you take any vector x this
23:44.660 --> 23:51.660
will be a positive quantity so that is H is
positive definite. From that sense that this
23:54.660 --> 24:01.660
algorithm is not a full proof algorithm, because
even it does not guarantee a locally stable
24:03.660 --> 24:10.660
unless H is positive definite - that is the
biggest drawback of a Newton algorithm, but
24:12.130 --> 24:18.050
still researches have used it because it is
very fast, but you see that it requires this
24:18.050 --> 24:24.860
weight update algorithm requires H inverse;
inverse of a Hessian matrix imagine a normal
24:24.860 --> 24:31.860
network has at least 100 of weights. So, W
and the minimal order is 100.
24:32.100 --> 24:38.620
So, H has at least 100 by 100. The dimension
of H would be at least 100 by 100, we have
24:38.620 --> 24:45.620
said not100 by 100, because it is a double
derivative. So, it dimension is huge and inversion
24:47.110 --> 24:53.630
is computationally very expensive. So, in
that since this algorithm all though we can
24:53.630 --> 24:59.760
use this algorithm to solve a simple cost
function optimization, but in terms of neural
24:59.760 --> 25:06.760
network this is a I would say it is a infeasible
algorithm, we cannot implement it because
25:06.940 --> 25:13.110
Hessian cannot be computed, it will take lot
of time.
25:13.110 --> 25:20.110
So, the Newton algorithm starting with any
initial condition W naught iterate following
25:21.750 --> 25:28.750
equation until stopping criteria is met, which
is W t plus 1 is W t minus H inverse g t.
25:29.650 --> 25:36.450
This is a Newton algorithm and the discussion
is that the Hessian matrix H t has to be positive
25:36.450 --> 25:42.680
definite matrix for all t that has to be you
cannot guarantee H all the time.
25:42.680 --> 25:47.230
Steepest descent method operates on the basis
of a linear approximation of the cost function,
25:47.230 --> 25:53.570
while Newton’s method uses quadratic approximation
of the error surface around the current point
25:53.570 --> 26:00.550
W t computationally intensive, whereas it
requires matrix inverse in H inverse for non-quadratic
26:00.550 --> 26:07.470
cost function E W convergence is not guaranteed
this makes it unsuitable for training feed
26:07.470 --> 26:09.980
forward network.
26:09.980 --> 26:16.980
Now, we will take the same example that we
consider earlier, E W is a scalar, we see
26:19.150 --> 26:26.040
that W is a scalar it’s no W is not a vector
here simply a scalar. This is a very simple
26:26.040 --> 26:33.040
function 0.5 W square minus 8 sin W plus 7
and if I use the because W is a scalar naturally
26:36.230 --> 26:43.230
del E upon del W is a scalar which is g and
H is del E by del square E by del W square.
26:46.720 --> 26:53.720
So, that is also a scalar
and weight update law delta W is minus eta
H inverse; so H here is a scalar so its inverse
27:05.100 --> 27:12.100
into g that is our algorithm and where g W
is f dash W, you know see this is not f this
27:13.350 --> 27:20.350
is E E dash W this is E double dash that is
this dash represent this is actually del E
27:26.520 --> 27:33.520
by del W and this is del square E by del double
square.
27:36.850 --> 27:43.850
So, this represent this, and this represent
this and using this, if you update, you see
27:49.090 --> 27:56.090
that for some initial condition, when W naught
is minus 6 it has 0.01. The weight diverges
27:58.800 --> 28:05.800
that is because H is not positive definite
so in this case, H is not positive definite;
28:12.809 --> 28:18.960
so H is a scalar here it is obviously it is
not greater than 0, it is a negative quantity
28:18.960 --> 28:25.840
and when W naught is minus 7 and eta is 0.001.
The final weight converges to local minimum
28:25.840 --> 28:32.840
here and when W naught equal to 0, so if you
start from here you reach here, but if you
28:34.600 --> 28:40.840
start from here you reach here, the final
weight converges to the global minimum.
28:40.840 --> 28:47.570
But then we really did not achieve much with
respect to in compresentive gradient descent
28:47.570 --> 28:54.570
except, probably the gradient descent even
converging to the local minimum. Whatever
28:56.960 --> 29:02.500
the number of steps it will take to go to
the local minimum, probably using this Newton
29:02.500 --> 29:09.500
method, you can improve on that speed but
no way the desired goal of learning is achieved
29:10.559 --> 29:13.770
by Newton method.
29:13.770 --> 29:20.770
So, observation are as follows convergence
to global minimum depends on the selection
29:23.350 --> 29:29.820
of proper initial condition W naught. Algorithm
may diverge for unsuitable initial condition
29:29.820 --> 29:36.820
because H W is not a positive quantity or
it is in fact in general I would not say H
29:39.640 --> 29:46.640
is a scalar for our example, but in general
H the Hessian matrix is not positive definite
29:49.130 --> 29:56.130
or is not positive
definite always.
209