Data Analytics

Exploratory Data Analysis in Python - EDA

What is EDA? : Introduction

The term Exploratory Data Analysis (EDA) refers to the process of discovering patterns and relationships within a dataset. EDA is about making sense of the data at hand and getting to know it, often before modeling or analysis. EDA is an important step prior to model building and can help you formulate further questions and areas for investigation.

Goals of EDA

Depending on your analysis or model-building plan, EDA can take many different forms; However, the main goals of EDA are generally:

· Uncover the data structure and determine how it is coded

· Inspect and “get to know” the data by summarizing and visualizing it

· Detect outliers, missing data, and other anomalies and decide how/whether to address these issues

· Find new avenues for analysis and further research

Prepare for model building or analysis, including the following:

o Check assumptions

o Select features

o Choose an appropriate method

EDA techniques

Just as the goals of EDA may vary, so do the techniques used to accomplish those goals. That said, the EDA process generally involves strategies that fall into the following three categories:

· Data inspection

· Numerical summarization

· Data visualization

Data inspection : Data inspection is an important first step of any analysis. This can help illuminate potential issues or avenues for further investigation

For example, we might use the pandas .head() method to print out the first five rows of a dataset:

print(data.head())

Based on this output, we notice that hours of sleep is a quantitative variable. In order to summarize it, we’ll need to make sure it is stored as an int or float.

We also notice that there is at least one instance of missing data, which appears to be stored as nan. As a next step, we could investigate further to determine how much missing data there is and what we want to do about it.

Numerical summarization : Once we’ve inspected our data and done some initial cleaning steps, numerical summaries are a great way to condense the information we have into a more reasonable amount of space. For numerical data, this allows us to get a sense of scale, spread, and central tendency. For categorical data, this gives us information about the number of categories and frequencies of each.

In pandas, we can get a quick collection of numerical summaries using the .describe() method:

data.describe(include = 'all')

Based on this table, we can see that there are 187 unique student names in our table, with Kevin being the most common. The average student age is 13.75 years with students as young as 8 years and as old as 23 years.

Data visualization While numerical summaries are useful for condensing information, visual summaries can provide even more context and detail in a small amount of space.

There are many different types of visualizations that we might want to create as part of EDA. For example, histograms allow us to inspect the distribution of a quantitative feature, providing information about central tendency, spread, and shape (eg., skew or multimodality. The histogram below shows the distribution of the number of characters in each student name. We see that the average name is about 5-6 characters long and up to 10 characters long.

Other kinds of visualizations are useful for investigating relationships between multiple features. For example, the scatterplot below shows the relationship between hours spent studying and hours spent sleeping.

Though EDA is commonly performed at the start of a project — before any analysis or model building — you may find yourself revisiting EDA again and again. It is quite common for more questions and problems to emerge during an analysis (or even EDA itself!). EDA is also a great tool for tuning a predictive model to improve its’ accuracy. It is therefore useful to think of EDA as a cycle rather than a linear process in a data science workflow.

Conclusion :

EDA is a crucial step before diving into any data project because it informs data cleaning, can illuminate new research questions, is helpful in choosing appropriate analysis and modeling techniques, and can be useful during model tuning.

Fares Awad Muhammad
Mar, 25 2022