Data Science

Introduction to Statistics for Data Science

Introduction

Statistics is the science of conducting studies to collect, organize, summarize, analyze and draw a conclusion out of the data. It is nothing but learning from data.

The field of math Statistics mainly deals with collective information, interpreting those information from data set and drawing conclusion from it. It can be used in various fields.

For example, when we observe any cricket matches there are various terms used like batting average, bowling economy, strike rate, etc. Also we can observe many graphs and data visualizations. This things are the part of statistics. Here information is analyzed and various results are shown accordingly.

We can talk about statistics all the time but do we know the science behind it?

Here by using various methods various large cricket organizations compare players, teams and rank them accordingly. So if we learn the science behind it we can create our ranking, compare different thing and debate with hard facts.

Stats is very important in the field of analytics, Data Science, artificial intelligence ai, machine learning models, deep neural networks (deep learning). It is a used to process complex problems in the real world so that data professionals like data analyst and data scientist can analyze data and retrieve meaningful insights from data.

In simple words, stats can be used to derive meaningful insights from data by performing mathematical computations on it.

The field of statistics is divided into two parts Descriptive statistics and Inferential statistics. And data has two types quantitative data and qualitative data and it can be either labelled data or unlabeled data.

Some important terms used

Population: In statistics, a population is the entire pool from which statistical sample is drawn. For example: Consider all students in a college. All students in the college are considered as population. Population can be contrasted with samples.

Samples: Sample is subset of the population. Sample is derived from population. It is representative of population. It refers to set of observation drawn from population.

It is necessary to use samples for research because it is impractical to study the whole population. For example, we want to know the average heights of boys in college.

So we can’t consider population as there can lots of boys and measuring height and calculating height is not reliable. So for such cases samples are taken. As sample is representative of population. Certain amount of boys are selected as a sample and average is computed.

Variable: A characteristic of each element of population or a sample is called as variable.

Types of Statistics

So basically statistics is divided into 2 major categories i.e. Descriptive and Inferential statistics.

Descriptive statistics:

This is one of the very important part of stats. In this type we deal with numbers that can be numbers, figures or information to describe any certain phenomena. These numbers are known as descriptive statistics.

It helps us to organize and summarize data using numbers and graphs to look for a pattern in the data set.

Some examples of this type of statistics are Measures of central tendency which include mean, median, mode, etc. Also includes Measures of variability that are standard deviation, range, variance, etc.

Example: Reports of production, cricket batting averages, ages, ratings, marks, etc.

Inferential statistics:

To make an inference or draw a conclusion from the population sample data is used. Inferential statistics is a decision, estimate, prediction or generalization about a population based on the sample.

Inferential statistics is used to make interferences from the data whereas descriptive statistics simply describes what’s going on in our data.

Scenario based study:

Suppose a particular college has 1000 students. We are interested to find out how many of the total students prefer eating in canteen and how much prefer eating in mess. A random group of 100 students were selected and hence it becomes our sample data.

So, population size = 1000 college students

sample size = 100 random students selected

So now we can do survey with this 100 student sample and after doing the survey we get the following insights.

So after analyzing the data we get the following visualizations.

Insights rederived:

72 % of students prefer eating in canteen.
Of the total students who prefer canteen 44.4 % are from 4th year.
Of the total number of students who prefer canteen 72% are from 3rd and 4th year.
1st year students are more inclined towards eating in mess.

The above statistics give the trends of data among the sample data. In this insights we are using numbers hence this all is included in Descriptive Statistics.

Now, suppose we wanted to open a canteen or mess in the college from the above insights we can assume that –

3rd year and 4th year students are main target to start the business.
To get more sales you can provide discounts to 1st year and 2nd year students.
Since from the above insights we can conclude that canteen is better option than that of mess to run a business and most of the students in the data are inclined towards canteen than that of mess.

So here we made interferences/assumptions/estimations from the above insights for the whole college on the basis of the sample data. Hence this is a crucial part of Inferential statistics.

Jay Charole
Mar, 11 2022