WEBVTT
Kind: captions
Language: en
00:00:08.080 --> 00:00:23.640
So, we have come to a situation where the
problem is that, from the population we draw
00:00:23.640 --> 00:00:38.640
a certain number of samples, say n samples,
and a part of that we will be found to have
00:00:38.640 --> 00:00:43.470
the value of 1, a part of that we will have
the value of 0.
00:00:43.470 --> 00:00:48.620
And we are trying to find out what proportion
has the value of 1.
00:00:48.620 --> 00:00:59.040
So, we have drawn n samples.
Suppose the number of 1s in that n sample
00:00:59.040 --> 00:01:16.610
is called X.
So, X the number of 1’s in the sample of
00:01:16.610 --> 00:01:31.060
n individuals.
So, X is a fraction of n and we will now face
00:01:31.060 --> 00:01:38.640
the problem of finding what is the mean of
X, the standard deviation of X, and so on
00:01:38.640 --> 00:01:45.820
and so forth.
If I draw that n number of samples again and
00:01:45.820 --> 00:01:51.690
again from that population, I will not get
the same value of X every time.
00:01:51.690 --> 00:01:59.320
I will get different values, and that those
values of X will also have a distribution.
00:01:59.320 --> 00:02:05.950
And we have just concluded that, if the number
of samples becomes large, that distribution
00:02:05.950 --> 00:02:13.450
will become more or less a normal distribution.
And now we are we are facing the problem of
00:02:13.450 --> 00:02:18.341
finding what is the mean of X, and what is
the standard deviation of X.
00:02:18.341 --> 00:02:33.200
So, our problem is
is the mean of X and the standard deviation
00:02:33.200 --> 00:02:39.420
of X.
Now, X is the number of 1’s in the population
00:02:39.420 --> 00:02:47.100
in in the in the sample of n individuals.
Therefore, in finding out what X is, what
00:02:47.100 --> 00:02:52.090
I am doing is that, I am picking up each individual
from the sample and we are asking whether
00:02:52.090 --> 00:02:54.230
it is 1 or 0?
Again I am picking the second individual,
00:02:54.230 --> 00:02:58.360
I am asking: 1 or 0?
This is the same as what we are doing earlier.
00:02:58.360 --> 00:03:01.209
That means, we are actually making a measurement
of the Y’s.
00:03:01.209 --> 00:03:05.910
Each measurement is Y, which can have two
values 1 or 0.
00:03:05.910 --> 00:03:12.370
We have seen that earlier.
So, X, the measurement of X is actually n
00:03:12.370 --> 00:03:20.750
measurements of Y.
So, which is nothing but Y plus Y plus Y that
00:03:20.750 --> 00:03:31.750
goes on n times.
So, if that be so, we have already found out
00:03:31.750 --> 00:03:37.430
what is the mean of Y.
Then the mean of X is nothing but the mean
00:03:37.430 --> 00:03:46.660
of Y, n times.
So, then we can argue that, the the mean of
00:03:46.660 --> 00:03:56.170
X can be simply expressed as the mean of Y
plus the mean of Y plus the mean of Y ...
00:03:56.170 --> 00:04:08.290
n times, and we know what the mean of Y is.
We know that the mean of Y is p.
00:04:08.290 --> 00:04:18.009
Therefore, n times of p is np.
So, that should be the mean of X.
00:04:18.009 --> 00:04:24.949
Now, what will be this variance?
I would argue that, we would carry carry out
00:04:24.949 --> 00:04:33.199
the variance exactly in the same way.
So, the variance of Y variance of X, so, variance
00:04:33.199 --> 00:04:49.669
of X should be again each measurement of X
is nothing but n measurement of Y, and we
00:04:49.669 --> 00:04:56.379
know the variance in the measurement of Y,
and therefore, this is nothing but the n times
00:04:56.379 --> 00:05:04.240
the variance of Y.
Okay?
00:05:04.240 --> 00:05:13.860
And the variance of Y we have already learned,
n variance of Y is p into 1 minus p.
00:05:13.860 --> 00:05:22.659
So, that should be the variance of X.
So, we have a distribution whose the variance
00:05:22.659 --> 00:05:25.389
is known.
Therefore, the standard deviation standard
00:05:25.389 --> 00:05:36.379
deviation would be square root of np 1 minus
p.
00:05:36.379 --> 00:05:42.699
Okay?
Now, what are we actually trying to find out?
00:05:42.699 --> 00:05:50.270
We are actually trying to find out the proportion
the measured proportion.
00:05:50.270 --> 00:06:04.129
The measured proportion
00:06:04.129 --> 00:06:11.610
the measured proportion is that number that
we have found: X is the number that we have
00:06:11.610 --> 00:06:15.240
found, divided by the total number of samples.
That is the measured proportion.
00:06:15.240 --> 00:06:19.830
Let us call it measured therefore;
it is not the proportion out there.
00:06:19.830 --> 00:06:24.860
Let us call it with a hat.
So, that we know that we are talking about
00:06:24.860 --> 00:06:29.939
the measured proportion: the 1’s that we
have measured using the number of, n number
00:06:29.939 --> 00:06:38.089
of samples is ̂p̂ and that will be the that
will be the measured proportion.
00:06:38.089 --> 00:06:49.879
And if we do a large number of trials of taking
n samples each time, we will get a distribution
00:06:49.879 --> 00:06:56.360
of the measured proportions.
For 1 1 sampling; that means, we have taken
00:06:56.360 --> 00:07:01.330
n samples, we have got a measured proportion.
If we do that again, taking another n samples,
00:07:01.330 --> 00:07:05.099
we will get another value of this.
A third time you will get another value of
00:07:05.099 --> 00:07:10.679
this.
Ultimately, the p̂ will have a distribution.
00:07:10.679 --> 00:07:15.110
What will be the mean of that distribution?
What will be the mean of the distribution
00:07:15.110 --> 00:07:26.240
of the p̂’s?
So, that will be let us call it mean of p̂.
00:07:26.240 --> 00:07:36.020
That will be the mean of X mean of X divided
by n.
00:07:36.020 --> 00:07:41.159
Since the p is X by n, therefore, it will
be this, and mean of X is something we know:
00:07:41.159 --> 00:07:45.009
n times p.
So, n times p by n.
00:07:45.009 --> 00:07:52.870
Therefore, it will be p.
So, the the... if I repeatedly make measurements,
00:07:52.870 --> 00:07:58.330
each time taking n samples, I will get a distribution
and that distribution will have a mean which
00:07:58.330 --> 00:08:05.550
is the the proportion of the proportion ‘out
there’ in the in the population.
00:08:05.550 --> 00:08:14.059
So, that is a good thing.
That we can obtain the the actual proportion
00:08:14.059 --> 00:08:20.129
by making repeated measurements, every time
taking n samples.
00:08:20.129 --> 00:08:30.389
And what will be the standard deviation?
The standard deviation, σ of p̂, that will
00:08:30.389 --> 00:08:41.870
be the the σₓ divided by σₓ divided
by there are n samples n.
00:08:41.870 --> 00:08:55.220
Now, we know that σₓ was square root of
np 1 minus p, and here is n.
00:08:55.220 --> 00:09:07.080
Therefore, this comes to be square root of
p 1 minus p by n.
00:09:07.080 --> 00:09:17.470
So, the the conclusion is that if there is
a proportion in the population out there and
00:09:17.470 --> 00:09:25.210
I am trying to find it using a sampling process.
I am drawing samples, n samples each time.
00:09:25.210 --> 00:09:31.280
If I repeat the process again and again, then
every time I will measure a different value
00:09:31.280 --> 00:09:38.170
of the the measured proportion p̂, and that
will have a distribution.
00:09:38.170 --> 00:09:45.200
And that distribution will have a mean at
the population mean p, and that mean that
00:09:45.200 --> 00:09:55.780
p̂ will have a distribution whose standard
deviation will be this, a well defined number.
00:09:55.780 --> 00:10:03.950
Well you might ask: what is the guarantee
that this distribution will be a normal distribution?
00:10:03.950 --> 00:10:15.140
Earlier our argument for rooting on a normal
distribution was the central limit theorem.
00:10:15.140 --> 00:10:19.520
But here we cannot apply the central limit
theorem, because it is a different problem
00:10:19.520 --> 00:10:26.770
we are dealing with.
So, how do we ensure that it will be a a normal
00:10:26.770 --> 00:10:32.140
distribution?
Now, it has been found that the question is
00:10:32.140 --> 00:10:39.010
is legitimate.
It has been found that, well it goes by a
00:10:39.010 --> 00:10:44.550
by a rule of thumb.
It is not not difficult to see that, if the
00:10:44.550 --> 00:10:50.940
number of 1’s in the population is too small,
it is like a dwindling dwindling population,
00:10:50.940 --> 00:10:58.840
something close to extinction, if that is
so, then the distribution will not be a a
00:10:58.840 --> 00:11:02.750
normal distribution.
So, there are situations where it will not
00:11:02.750 --> 00:11:07.110
be a normal distribution.
Therefore, we need to talk about under what
00:11:07.110 --> 00:11:12.350
condition we can expect more or less a normal
distribution.
00:11:12.350 --> 00:11:19.910
And there is a rule of thumb for that.
The rule of thumb is... this is rule of thumb.
00:11:19.910 --> 00:11:27.690
It is not a a theorem.
Just people have tried out and found that
00:11:27.690 --> 00:11:39.850
it works.
That is, if the probability p — and p is
00:11:39.850 --> 00:11:55.750
a probability of getting a 1 — times times
the number of samples
00:11:55.750 --> 00:12:11.230
samples is greater than 10, and and if the
the probability of the other one; that means,
00:12:11.230 --> 00:12:32.300
1 minus p (1-p) times the number of samples,
that is also greater than 10, then then the
00:12:32.300 --> 00:12:45.790
distribution
of the measured p̂ will be approximately
00:12:45.790 --> 00:13:00.070
normal.
If it is normal, then we can apply all that
00:13:00.070 --> 00:13:04.730
we have learnt about the normal distribution.
That is a major advantage.
00:13:04.730 --> 00:13:09.250
But the caveat is that, that is not applicable
to all cases.
00:13:09.250 --> 00:13:18.930
For example, if if the if the... in the in
the example that we have taken, in the example
00:13:18.930 --> 00:13:31.810
that you have taken. p was 60 0.6, sorry I
will do it right correctly p was 0.6, 1 minus
00:13:31.810 --> 00:13:43.530
p was 0.4, and suppose we take 20 samples.
n is equal to 20.
00:13:43.530 --> 00:14:00.850
Then it will become p times 20, which is p
into n is equal to... p into n will be 12
00:14:00.850 --> 00:14:15.630
and 1 minus p into n will be 8.
So, obviously, this is not bigger than ten
00:14:15.630 --> 00:14:21.810
and therefore, in this collection the normal
distribution will not apply.
00:14:21.810 --> 00:14:27.740
So, if you keep on drawing samples of 20s
and expect p̂ to be distributed in normal
00:14:27.740 --> 00:14:32.779
distribution, that will not work.
You have to take more than 20 samples.
00:14:32.779 --> 00:14:36.220
Okay?
Then only the normal distribution will work.
00:14:36.220 --> 00:14:46.779
Before we end, let us quickly do an example.
There are... consider the Mendel-type experiment
00:14:46.779 --> 00:15:02.660
in which there are tall plants and and short
plants, and you have collected a sample of
00:15:02.660 --> 00:15:06.761
50.
And you have counted the number and you have
00:15:06.761 --> 00:15:20.240
found that 33 are are tall.
I will write.
00:15:20.240 --> 00:15:41.410
So, your your measured proportion is 33 by
50 is equal to 0.66.
00:15:41.410 --> 00:15:56.570
The question is, can you state that the actual
proportion out there lies in the range 0.64
00:15:56.570 --> 00:15:59.700
to 0.68?
Can you say...?
00:15:59.700 --> 00:16:07.540
So, actual p lies in this range?
This is the problem.
00:16:07.540 --> 00:16:11.880
Okay?
So, how do we attack this problem?
00:16:11.880 --> 00:16:21.320
We know, the the mean will be... we have calculated
the mean, sample mean, to be this, and we
00:16:21.320 --> 00:16:35.420
can calculate the sample standard deviations:
σ p̂ to be square root of p 1 minus p by
00:16:35.420 --> 00:16:39.290
n.
Now, p we do not know.
00:16:39.290 --> 00:16:44.310
Therefore, we substitute by whatever we have
measured, which is this.
00:16:44.310 --> 00:16:55.920
So, it comes to be 0.66 times 1 minus 0.66
divided by the number of samples.
00:16:55.920 --> 00:16:59.320
We have taken 50.
Okay?
00:16:59.320 --> 00:17:06.040
And that turns out to be 0.067.
Okay?
00:17:06.040 --> 00:17:14.890
Now, will this distribution be a normal distribution?
Let’s quickly check: p is 0.66.
00:17:14.890 --> 00:17:29.669
So, p into n is 0.66 times 50 is is 33, and
1 minus p times n is equal to 17.
00:17:29.669 --> 00:17:33.249
So, both are bigger than 10.
So, normal!
00:17:33.249 --> 00:17:34.249
Okay?
Distribution is normal.
00:17:34.249 --> 00:17:42.429
That is good.
So, if the distribution is normal, then we
00:17:42.429 --> 00:17:56.049
can we can picture the distribution something
like this.
00:17:56.049 --> 00:18:04.869
And we know the mean will be at p, and the
standard deviation we have just calculated.
00:18:04.869 --> 00:18:09.869
Okay?
The question was the probability of this.
00:18:09.869 --> 00:18:30.249
So, probability of P of the the mean of p
is — not p̂, p — is within 0.66 and 0.64
00:18:30.249 --> 00:18:44.470
0.2 is within 0.2 of the mean: this is what
we are trying to calculate.
00:18:44.470 --> 00:18:52.890
So, p̂ 0.66 minus 0.2 and plus 0.2.
So, our statement is: what is the probability
00:18:52.890 --> 00:18:56.600
that this is true?
So, this is again... sorry...
00:18:56.600 --> 00:19:05.110
I will write this p as capital P because it
is a probability.
00:19:05.110 --> 00:19:13.690
Probability of again like the like the earlier
problems, we will reverse the argument.
00:19:13.690 --> 00:19:27.210
We will say p̂ is within 0.2 of the mean
of p, the distance between this and that is
00:19:27.210 --> 00:19:31.419
the same as distance between this and this.
So, we have the same argument.
00:19:31.419 --> 00:19:39.200
And then we can... we have to express that
as a multiplier of the standard deviation.
00:19:39.200 --> 00:19:40.919
Then we get the value of z.
Okay?
00:19:40.919 --> 00:19:50.259
So, the z value z is the multiplier 0.2, zero
point...
00:19:50.259 --> 00:20:12.940
sorry 0.2 or 0.02, 0.02 sorry 02 0.02 divided
by the this is 0.067, and that turns out to
00:20:12.940 --> 00:20:14.169
be 0.3.
z turns out to 0.3.
00:20:14.169 --> 00:20:21.330
So, we are essentially saying that p̂ is
within 0.3 standard divisions of the the mean.
00:20:21.330 --> 00:20:27.510
Now if you now consult the z table and find
out what is the... suppose 0.3 is somewhere
00:20:27.510 --> 00:20:38.940
here... 0.3, 0.3 would be 0.6 0.3 will be
somewhere here may be.
00:20:38.940 --> 00:20:56.869
So, if you find this area if you find this
area then you will find that
00:20:56.869 --> 00:21:18.450
the the area 0.6179.
So, this area we will we will subtract 0.5
00:21:18.450 --> 00:21:27.820
from it.
So, 0.6179 minus 0.5, thereby we get only
00:21:27.820 --> 00:21:39.739
this area times two.
That turns out to be 0.2358.
00:21:39.739 --> 00:21:52.239
So, that is the probability of having the
actual the the proportion of the two populations
00:21:52.239 --> 00:21:59.590
out there is to be to be in this range, to
be in this range is only this much.
00:21:59.590 --> 00:22:06.759
Which means that, I can state with only 23
percent confidence that the population will
00:22:06.759 --> 00:22:10.720
be in this range.
With that population would be in this range.
00:22:10.720 --> 00:22:17.679
It is a very low confidence.
That immediately tells us that we need to
00:22:17.679 --> 00:22:21.059
change our strategy.
We need to do something else in order to increase
00:22:21.059 --> 00:22:26.279
the level of confidence for proportion measurements.
I will stop here and continue with the next
00:22:26.279 --> 00:22:26.609
class.