Data Science

Basic Statistics for Data Science

Why Statistics?

This question blinks each person mind before learning statistics. The purpose of learning statistics is to collection, description, analysis and inference of conclusion from quantitative data. The stats help a lot in data analysis and data science. The data analysis with help of some statistical principles to analysis and predict the data.

The two major areas of statistics are known as descriptive statistics and inferential statistics

1)Descriptive statistics

2)Inferential statistics

Many people studying the statistics are not understand the purpose of statistics. This will help mainly analysing the data and giving better conclusions with help of some theory. The probability also helps to statistics for better conclusions examples random variables tell the skew of the data. That data will help for inferential statistics. In this block we can discuss descriptive statistics.

1)Descriptive statistics:

Descriptive statistics mostly focus on the central tendency, variability, and distribution of sample data. Central tendency means the estimate of the characteristics, a typical element of a sample or population, and includes descriptive statistics such as mean, median, and mode. Variability refers to a set of statistics that show how much difference there is among the elements of a sample or population along the characteristics measured, and includes metrics such as range, variance, and standard deviation.

The descriptive statistics is classified into two types

a) Univariant Analysis

b) Bivariant Analysis

Univariant Analysis is measuring the central tendency and measure of spread and the Bivariant analysis is measure of relationship like co-variable and correlation.

Measure of central Tendency:

A measure of central tendency will talk about where the data centred or lied. The measure of central tendency is mostly used because this theory is applied simply and work on real time examples.

We have different methods to measure the central tendency of the data. Mean, Median and Mode are the methods mainly used to measure the central tendency. We learning stats here for understanding the data given in the data set visualization and prediction the data as per user concern. Mainly we will learn stats for the data analysis. The central tendency will help a lot to knowing the outliers.

What is outlier?

The data analyst and data scientist frequently used the word outliers. The outliers are nothing but the data far from to related data. If the data points are far distance means the outliers are high. The central tendency helps to know is outlier are there or not. If the outliers are huge the data is corrupted. The dataset will not give good visualization and also it will not good for predictions.

There are three mainly used central tendency are

1) Mean

2) Median

3) Mode

1) Mean:

The mean is nothing but the sum of observations by total number of observations.

Mean = {Sum of Observation} ÷ {Total numbers of Observations}

Example:

The athlete doing running practice on 100 meters track the below sample are the mints 20,40,30,33,22,54,25,26,77,88,99,200,45,55,22,54,66. Each samples says that the athlete running time in seconds. Here we taken random samples to give better understanding about the mean. We all know mean formula but we still not understand depth of the mean. In this blog we will understand more deeply by using the samples. The above sum of total number of samples are 956 and the total number of samples are 17. The mean of the sample data is 56.23. absorbed the above samples mainly the data lies from 20 to 80 but one sample is 200 by this sample the total sample mean is changed the 200 is the biggest outlier in this samples.

The data analyst or data scientist must know the depth of the data. Checking the outliers of the is important and it will give the robust results. Here we need to understand that mean is not the good method if the outliers present. If the outliers are not there, we can take the mean of that sample. Here nothing is constant like taking mean or other method. Analysing the data is important. Then we can decide which method is fit for the data. Here at halite takes 200 sec to complete 100 meters. 200 is outlier it changes the data by removing the 200 the mean is 47.25. compare the difference with 200 we got 56.23 and without 200 we got 47.35 as the mean. 10 id the difference to the two means that will impacts a lot.

The mean is corrupted if the outliers are presented. By taking the total sum of the observation the outliers are also counted on this sum if the outlier having huge difference the mean will change drastically by some outliers that is the reason we will not majorly use mean to predict if the outliers present the mean will not give good results.

Median:

The median is the value of the middle item of a set of items that has been sorted into ascending or descending order. In an odd numbered sample of n items, the median is the value of the item that occupies the (n + 1)/2 position. In an even-numbered sample, we define the median as the mean of the values of items occupying the n/2 and (n + 2)/2 positions (the two middle items).

Taking the same sample data taken in 20,40,30,33,22,54,25,26,77,88,99,200,45,55,22,54,66. This are the same samples taken in mean and first we need to sort the data to ascending order like 20,22,22,22,25,26,30,33,45,54,54,55,66,77,88,99,200. Here the total number of samples are 17 so (17+1)/2 is 9^th position. The 9^th position is 45 so the median is 45. Here we can see the median gives better point estimated value then mean the same outlier presented in mean and median but mean given 56.23 and the median gives 45. By comparing and saying this the median is better than mean if the outlier presented. Here if outlier is huge the median also corrupted.

Mode:

The distribution has a single value that is most frequently occurring is called the unimodal. The distribution has two most frequently occurred is called the bimodal and three most frequently occurred values are called trimodal.

The mode is nothing but the sample which occurs frequently in the samples data we take above dataset with ascending order 20,22,22,22,25,26,30,33,45,54,54,55,66,77,88,99,200 here we can see the most repeated are 22. It repeated 3 times so it is called trimodal we can take the 22 as mode 22 is most repeated value in the sample data. The mode helps sometimes from outliers. We will take the results with repeated numbers so it will not consider the outliers. The mode will help character data more then the numerical data.

Measuring of Spread:

Measures of spread include the range, quartiles and the interquartile range, variance and standard deviation. we examine the most common measures of dispersion: range, mean absolute deviation, variance, and standard deviation.

Range:

The range is the difference between the maximum and minimum values in a dataset.

Range = Maximum value − Minimum value.

Taking the same samples 20,22,22,22,25,26,30,33,45,54,54,55,66,77,88,99,200. This is the data taken from the above used sample. It is ascending order data the minimum value is 20 and the maximum value is 200. The difference is 180 this range will not give great results because the outliers are presented last or first of the sorted data it will not give the robust results.

Variance:

The deviation around the mean is called the variance.

(variance)2=sum (data points-mean)2/number of samples

The above figure shows the calculation part of variance here to calculate the variance we need to sort the data first and take the difference between sample and mean. We are squaring each difference sample. Total the difference sample by number of samples this will give the variance. Here we can see the variance with outlier and variance without outlier the variance with outlier is 11447.52 and without outlier is 715.47. the variance will talk about the spread of the samples. The standard deviation we give good clarity on spread.

Standard Deviation:

The square root of variance is called standard deviation. We have variance but why we need Sd. The values of variance in squares the spread is high for better understanding we square root the variance it will give the spread.

SD=square root(variance)

IQR (Inter Quartile Range):

The interquartile range formula is a measure of the middle of the data the middle point is nothing but median and the measure of dispersion in statistics is called interquartile Range. The difference between upper quartile and lower quartile is called interquartile range.

The quartile is nothing but the division of four parts that is divide in 25, 50 and 75 percentiles. The IQR is measured with boxplot. The IQR is mainly used for knowing the outliers in the data. The IQR anility the data and give the box plot. The box plot helps to understand the outliers the above and below IQR called Whiskers.

The above fig shows the box plot. This is the salary data from AMCAT dataset here the box shows the IQR and the above and below of the IQR is called whiskers the above whiskers we can see the black diamond that is the outliers of the data. In this data the salary column has the huge outliers and the middle black horizontal line in boxplot is median of the data. The median also helps to know the outliers in the boxplot.

HARSHAVARDHAN REDDY PEDDIREDDY
Mar, 08 2022