Data Science

The search for missing values with the help of mean, median and mode!

Do you know according to a survey of about 80 data scientist conducted in 2016 for the second year in a row by CrowdFlower, provider of a “data enrichment” platform for data scientists found that they spend 60% of time in cleaning and organising the data.

Photo credit : CrowdFlower Data Science Report 2016

And most of the times during data cleaning we come across a datatype called “NaN” (Not a Number) and is one of the common ways to represent the missing value in the data. In real world, it is nearly impossible to get a dataset without a single value missing. Three major reasons for missing values are:

Human error while recording the data.

Poor maintenance of datasets leading to data corruption and missing values.

Data intentionally not provided by the user.

Today I’ll give you simple logic to use basic tools of central tendency to effectively and fearlessly handle missing values from now onwards. I won’t be using any single tool but you can use Excel, Python, R, SQL etc which ever platform you are comfortable as I’ll show you to think mathematically. so lets begin...

The centre (mean, median and mode) of a bunch of data points is usually a good summary of the type of data we can expect from the group as a whole. In other words, the centre tells us the overall story of the data points in a nutshell. For instance if some one asks you “How much marks did you score in your school days?” , you won’t start telling marks starting from 1st grade instead you will give him/her a centre value which represents your performance over the years in school.

Mean (Average)

It takes the sum of each data point in a dataset and divides by the number of data point.

For example if 5 dogs produces total of 50 babies then we can easily conclude that each parent have 10 babies on an average.

Another example, if Jack have $10 and Jill have $20, then the mean amount they each have is $15. But this does not mean that they both can purchase Marvel comic costing $15 because Jack only have $10 in reality.

Mean tells us something about the data as a whole, but does not tell us about the individual data points involved in the dataset.

We can think of the mean as the balancing point, i.e., total distance below the mean is equal to the total distance above the mean.

For example let's consider the data set {2,3,6,9}

Mean = (2+3+6+9)/4 = 5

And we can observe that the total distance below the mean is equal to the total distance above the mean because 2+3=1+4.

When to use mean for filling out missing values?

Mean is good at measuring things that are relatively “normally” distributed, i.e, a distribution that has roughly the same amount of data on either side of the middle, and has its most common values around the middle of the data. Distribution tells us how often each value occurs in our dataset - i.e., frequency.

When not to use mean?

As we discussed, mean can be considered as the balancing point so if outliers are present or the data is skewed, the mean will be pulled by those unusual values having lower frequency to maintain the balance.

For example,

Case 1: if 10 people in a cafeteria have $20,000 income each :

Mean income: $20,000

Case 2: Now if Elon Musk enters the cafeteria who have $100,000 income, then

Mean income: $27,272

We can see that even though a majority of people have income $20,000 still the mean shifted toward the outlier to maintain the balance point giving us the wrong intuition. This is a major analysis error and we fall into the trap of politicians portraying the average income of people is on rise but actually the rich are getting richer.

So how we to tackle this ?

For this I will have to first introduce to another type of statistic called Median.

Median

It is the middle number if we lined up our data points from smallest to largest or the mid point when it’s arranged in ascending(or descending) . It’s the physical middle point and not a balance point in case of mean. It does not consider every value of the datapoints, just the middle.

For example, the median of {1,4,7,2,6} is 4 because when the numbers are put in order {1,2,4,6,7}, the number 4 is middle.

When to use median to fill missing values?

It is mostly used in the case when outliers are present in the data and please note outliers are present does not mean data is bad, we need to think whether the unusual data points belong in our dataset or not.

As in previous example of Cafeteria,

Case 1: Median Income: $20,000

Case 2: Median Income: $20,000

so, we can see that even though mean was distorted by the present of outlier but median remained the same giving us a better picture of income distribution of people present in the cafeteria.

When not to use both mean and median ?

Mean and median both are not used when dealing with qualitative data or categorical data.

For example, 10 people were asked to rate a book on a scale between 1 and 5 on Amazon, {1,1,1,2,3,5,5,5,5,5}

outcome :

Ratings ⭐Number of person
1 ⭐3
2 ⭐1
3 ⭐1
5 ⭐5

Now, we can see that on an average a person gave 3.3 ⭐ to the movie, i.e, Mean = 3.3 ⭐. But this is not at all practical as no one gave 3.3 ⭐ rating. And even if we consider it as approximately 3 ⭐ and use it in place of missing value it will lead to wrong intuition as only 1 person gave 3 ⭐.

Similarly, Median = (3+5)/2 = 4 ⭐ and its evident that it’s also not at all possible as no one gave 4 ⭐ as rating.

So how we to tackle this ?

This can be overcome by using another measure of central tendency called Mode.

Mode

The word comes from the latin word Modus, which means “manner, fashion or style”. This tells us which data point appears the most in our dataset.

For above example of book rating,

Mode = 5 ⭐

We can easily observe why mode is better when dealing with qualitative data as here the data type is both ordinal and discrete.

Bimodal data is an example of “Multimodal” data which has many values that are similarly common, usually multimodal data results from 2 or more underlying groups all being measured together.

For example we are given a task to analyse customer footfall or traffic and at which time in a restaurant,

From above if we conclude that the mean or average time people visit the restaurant is 15:30 then it would be a wrong conclusion as very less people visit during that time. Instead mode is better choice for analysis as it will give us the desire 2 peak time for the restaurant, i.e, 12:00 and 18:00 which we can then share with the stakeholder so they be better prepared and provide better service to their customers.

Conclusion

Statistics can be True or deceptive simultaneously at the same time.

Understanding the question you are trying to get answer is very critical.

Is the data relevant and is it answering those question or not?

Analyse and then decide what measure of central tendency (mean, median or mode) is needed here.

taral desai
Apr, 26 2022