Why Sample Standard Deviation uses n-1 for calculation ?

Data Analytics

Why Sample Standard Deviation uses n-1 for calculation ?

Introduction

When whole population(dataset) has to be analyzed statistically, small samples are made with different sampling techniques to get some ease while doing analysis.

Now there is an important term with respect to variability : ‘Standard Deviation’. Generally low value of Standard Deviation indicates, lower dispersion among data numbers. It means values are much closer to its mean.

Normally Population Mean(µ) is not known and therefore the Std.Deviation (σ) for population is also not known.

So we try to predict the population mean(µ) by sample mean(x̄) and population standard deviation (σ) by sample standard deviation(S).

Now lets look at the formula for Standard Deviation.

The notable change in both formula is at denominator. For population, whole size(N) is used and for Sample, size reduced by one(n-1) is used while calculating standard deviation.

If whole point of analyzing sample is to get representation of Population, then why there is a change?

Well there are explanations to it. Both Empirical and Theoretical.

Calculations

---> Lets understand this by example.

Here we are choosing first 10 natural numbers as our Population, now Standard Deviation for population (1,2,3,4,5,6,7,8,9,10) is σ = 2.87 after doing calculation.

---- > Now lets take some random samples within our dataset and calculate the standard deviation (S) for them.

Sample	Sample size (n)	Std. deviation of sample (S) with taking (n)	Std. deviation of sample (S) with taking (n-1)	Std. deviation of sample (S) with taking (n-2)
1,2,3,4,5	5	1.41	1.58	1.82
6,7,8,9	4	1.11	1.28	1.92
6,8,10	3	1.63	2	2.82

Values are calculated and compared for different cases. In comparison to n When (n-1) or ( n-2) is used; value becomes larger. Values are now much more nearer to Population Std. Deviation(σ = 2.87).

----- > The theory behind it goes like this:

We use sample mean(x̄) and sample std.deviation(S) to represent population mean(µ) and population std. deviation(σ) respectively.

Value of sample mean(x̄) approximately equals to the middle number of dataset.

Now if we think about the value of population mean(µ), it can be anything. It may or may not lie in middle considering the uncertainty of range.

Since each value is deduced from its mean to get the average distance, it causes the unevenness in the numerator. Numerator for sample becomes smaller than the numerator of population

Now, sample should be the best representation of whole dataset(Population). This will not happen if there is inequality in formula. To compensate the lesser value in numerator here denominator must also be decreased.

This can be done by reducing unit/ units from denominator. Hence (n-1) or (n-2) can be used.

* The concept of ‘Degree Of Freedom’ *

The Degree Of Freedom gives you an idea about how independently you can choose values in calculation. More no. of DOF means there are more amount of numbers from which you can choose your number. As you choose the numbers one by one and put in its place, the DOF reduces subsequently.

Lets say we choose sample as (6,7,8,9) and so mean is 7.5. We calculate (6-7.5)^2 , (7-7.5)^2 and then (8-7.5)^2. Now when it comes to last number, the only allowable choice is 9. Since our mean is 7.5 we cannot choose any other number. Since limitations in choosing last number, DOF is also reduced by one. Hence, using (n-1) is advised.

Important thing to remember!

*The sampling technique (random sampling, cluster sampling etc.), size of Sample and Skewness also matters while measuring the std. deviation.

Harsh Mehta
Dec, 27 2022

Add New Comments

Please login in order to make a comment.

Recent Comments

Be the first to start engaging with the bis blog.

Terms

About

Important Links

© 2024 Zep Analytics. All rights reserved.