Basic Statistics for Data Science
Why Statistics?
This question blinks each person mind before learning
statistics. The purpose of learning statistics is to collection, description,
analysis and inference of conclusion from quantitative data. The stats help a
lot in data analysis and data science. The data analysis with help of some statistical
principles to analysis and predict the data.
The two major areas of statistics are known as descriptive
statistics and inferential statistics
1)Descriptive statistics
2)Inferential statistics
Many people studying the statistics are not understand the purpose of statistics. This will help mainly analysing the data and giving better conclusions with help of some theory. The probability also helps to statistics for better conclusions examples random variables tell the skew of the data. That data will help for inferential statistics. In this block we can discuss descriptive statistics.
1)Descriptive statistics:
Descriptive statistics mostly focus on the central
tendency, variability, and distribution of sample data. Central tendency means
the estimate of the characteristics, a typical element of a sample or population, and includes
descriptive statistics such as mean, median, and mode. Variability refers
to a set of statistics that show how much difference there is among the
elements of a sample or population along the characteristics measured, and
includes metrics such as range, variance, and standard deviation.
The descriptive statistics is classified into two types
a)
Univariant Analysis
b)
Bivariant Analysis
Univariant Analysis is measuring the central tendency and
measure of spread and the Bivariant analysis is measure of relationship like
co-variable and correlation.
Measure of central Tendency:
A measure of central tendency will talk
about where the data centred or lied. The measure of central tendency is mostly
used because this theory is applied simply and work on real time examples.
We have different methods to measure
the central tendency of the data. Mean, Median and Mode are the methods mainly used to measure
the central tendency. We learning stats here for understanding the data given
in the data set visualization and prediction the data as per user concern.
Mainly we will learn stats for the data analysis. The central tendency will
help a lot to knowing the outliers.
What is outlier?
The data analyst and data scientist frequently used the word
outliers. The outliers are nothing but the data far from to related data. If
the data points are far distance means the outliers are high. The central
tendency helps to know is outlier are there or not. If the outliers are huge the data is corrupted.
The dataset will not give good visualization and also it will not good for
predictions.
There are three mainly used central tendency are
1)
Mean
2)
Median
3)
Mode
1)
Mean:
The mean is
nothing but the sum of observations by total number of observations.
Mean = {Sum of Observation} ÷ {Total numbers of Observations}
Example:
The athlete doing running practice on
100 meters track the below sample are the mints 20,40,30,33,22,54,25,26,77,88,99,200,45,55,22,54,66.
Each samples says that the athlete
running time in seconds. Here we taken random samples to give better understanding
about the mean. We all know mean formula but we still not understand depth of
the mean. In this blog we will understand more deeply by using the samples. The
above sum of total number of samples are 956 and the total number of samples
are 17. The mean of the sample data is 56.23. absorbed the above
samples mainly the data lies from 20 to 80 but one sample is 200 by this sample
the total sample mean is changed the 200 is the biggest outlier in this samples.
The data analyst or data scientist
must know the depth of the data. Checking the outliers of the is important and
it will give the robust results. Here we need to understand that mean is not
the good method if the outliers present. If the outliers are not there, we can
take the mean of that sample. Here nothing is constant like taking mean or
other method. Analysing the data is important. Then we can decide which method
is fit for the data. Here at halite takes 200 sec to complete 100 meters. 200 is
outlier it changes the data by removing the 200 the mean is 47.25.
compare the difference with 200 we got 56.23 and without 200 we got 47.35 as
the mean. 10 id the difference to the two means that will impacts a lot.
The mean is corrupted if the outliers
are presented. By taking the total sum of the observation the outliers are also
counted on this sum if the outlier having huge difference the mean will change
drastically by some outliers that is the reason we will not majorly use mean to
predict if the outliers present the mean will not give good results.
Median:
The median is the value of the middle
item of a set of items that has been sorted into ascending or descending order.
In an odd numbered sample of n items, the median is the value of the item that
occupies the (n + 1)/2 position. In an even-numbered sample, we define the
median as the mean of the values of items occupying the n/2 and (n +
2)/2 positions (the two middle items).
Taking the same sample data taken in 20,40,30,33,22,54,25,26,77,88,99,200,45,55,22,54,66. This are the same samples taken in mean and first we need to sort the data to ascending order like 20,22,22,22,25,26,30,33,45,54,54,55,66,77,88,99,200. Here the total number of samples are 17 so (17+1)/2 is 9^{th} position. The 9^{th} position is 45 so the median is 45. Here we can see the median gives better point estimated value then mean the same outlier presented in mean and median but mean given 56.23 and the median gives 45. By comparing and saying this the median is better than mean if the outlier presented. Here if outlier is huge the median also corrupted.
Mode:
The
distribution has a single value that is most frequently occurring is called the
unimodal. The distribution has two most frequently occurred is called the
bimodal and three most frequently occurred values are called trimodal.
The mode is nothing but the sample which
occurs frequently in the samples data we take above dataset with ascending order
20,22,22,22,25,26,30,33,45,54,54,55,66,77,88,99,200
here we can see the most repeated are 22. It repeated 3 times so it is called
trimodal we can take the 22 as mode 22 is most repeated value in the sample
data. The mode helps sometimes from outliers. We will take the results with
repeated numbers so it will not consider the outliers. The mode will help character data
more then the numerical data.
Measuring of
Spread:
Measures
of spread include the range, quartiles and the interquartile range, variance
and standard deviation. we examine the most common measures
of dispersion: range, mean absolute deviation, variance, and standard
deviation.
Range:
The range is
the difference between the maximum and minimum values in a dataset.
Range
= Maximum value − Minimum value.
Taking the
same samples 20,22,22,22,25,26,30,33,45,54,54,55,66,77,88,99,200.
This is the data taken from the above used sample. It is ascending order data
the minimum value is 20 and the maximum value is 200. The difference is 180
this range will not give great results because the outliers are presented last
or first of the sorted data it will not give the robust results.
Variance:
The deviation
around the mean is called the variance.
(variance)2=sum
(data points-mean)2/number of samples
The above figure shows the calculation part of variance here
to calculate the variance we need to sort the data first and take the
difference between sample and mean. We are squaring each difference sample.
Total the difference sample by number of samples this will give the variance.
Here we can see the variance with outlier and variance without outlier the
variance with outlier is 11447.52 and without outlier is 715.47. the variance
will talk about the spread of the samples. The standard deviation we give good
clarity on spread.
Standard Deviation:
The
square root of variance is called standard deviation. We have variance but why
we need Sd. The values of variance in squares the spread is high for better
understanding we square root the variance it will give the spread.
SD=square
root(variance)
IQR (Inter Quartile
Range):
The interquartile range formula is a
measure of the middle of the data the middle point is nothing but median and
the measure of dispersion in statistics is called interquartile Range. The
difference between upper quartile and lower quartile is called interquartile
range.
The
quartile is nothing but the division of four parts that is divide in 25, 50 and
75 percentiles. The IQR is measured with boxplot. The IQR is mainly used for
knowing the outliers in the data. The IQR anility the data and give the box
plot. The box plot helps to understand the outliers the above and below IQR
called Whiskers.
The above fig shows the box plot. This
is the salary data from AMCAT dataset here the box shows the IQR and the above
and below of the IQR is called whiskers the above whiskers we can see the black
diamond that is the outliers of the data. In this data the salary column has
the huge outliers and the middle black horizontal line in boxplot is median of
the data. The median also helps to know the outliers in the boxplot.
- HARSHAVARDHAN REDDY PEDDIREDDY
- Mar, 08 2022
Kalash Jindal
Awesome