What is Exploratory Data Analysis (EDA) ?
Data Science

What is Exploratory Data Analysis (EDA) ?

Introduction

If you’re getting started with Data Science & Machine Learning, you might have come across EDA as one of the terms in different blogs, videos, etc. So what is EDA? Exploratory Data Analysis (EDA) is the process of getting an overall understanding of the data before making any final decisions. It helps us to find statistical summaries,  detect outliers, find interesting insights in the data, and shows the correlation between different dependent & independent variables if any, etc. After EDA is done and insights are drawn out, this data is further used for more sophisticated analysis and modeling, such as building a Machine Learning model.

Why EDA is important in Data Science?

Any real-world dataset isn’t that easy to work with, there are a lot of different challenges like the data is mostly dirty data it needs to be cleaned, there are null values that should be handled appropriately, detect and remove outliers if any are present, all these things needs to be taken care of. Data Scientists perform exploratory analysis to make sure their assumptions are right before taking any business decisions. It also helps stakeholders to ask better questions once they start getting a grasp of the underlying pattern in the data. All these things make EDA an important part of a Data Science or Analytics Project.

Tools for EDA

Here I have mentioned the common tools used for EDA:

  1. Python – Python is an object-oriented, high-level, interpreted language that has syntax like English statements which makes it easy to learn & implement for beginners. A lot of open source and well-defined libraries like Pandas & Numpy(for data manipulation), Sklearn(for Modelling), Matplotlib & Seaborn(for Visualization) are available for working with Data Science & Machine Learning projects. 

        2. R – R is an open-source programming language mostly used among Statisticians in Data Science for developing statistical observations, data                    analysis and visualizations.

Understanding EDA with an example

The dataset I’m working on for this example is of Heart Disease patients, which has records of patients having different clinical parameters and if they were diagnosed with heart disease or not.

Initially, I loaded the data and imported the required libraries such as pandas(for data manipulation), NumPy, Sklearn (for modeling), and Matplotlib (for visualization).

Note: Observations made by me are mentioned with bullet points.

  • The dataset loaded is in CSV (comma separated values) format.
  • The “shape ” attribute of the data frame shows the number of rows and columns present in the dataset which is 303 rows and 14 columns.

Now let’s have a look at a few records in the data frame

  • To get a closer look at the records, the “head()” function shows the first n records in a data frame (default is 5).
  • Out of the 14 columns, 13 are independent and 1 (target) is the dependent variable, i.e if a patient has heart disease or not.

Next, I’ve plotted a bar graph to see the count of individual values in comparison with each other.

  • Number of records having target variable 1 is 165 whereas for target 0 is 138.
  • Target 1 here represents positive class i.e a patient has heart disease and Target 0 for negative class.

You can learn more about the structure of the dataset by using methods like “info()” to get an idea about the datatypes of the columns and “describe()” for getting a statistical summary of the numerical columns.

Dealing with Null values

If any dataset has null values, it becomes hard to work on it for further analysis and modeling so we need to check for it and handle them using appropriate techniques. 

To demonstrate this I’m using the popular titanic dataset (cause the heart disease dataset doesn’t have any null values) which consists of passenger details such as Gender, Age, Pclass, Embarked, Sibsp(no. of siblings), Survival (whether they survived or not). Panda’s data frame has a function “isnull()” which shows if null values are present or not.

  • The Age column has 177 null values and Embarked has 2 null values.

There are different ways to handle null values:

  1. Remove the column if it doesn’t have sufficient number of non-null values.
  2. Remove the complete row containing null value.
  3. Fill the column with appropriate values such as mean, median & mode based on the data type the column consists of.

Here we’re going to fill the Age column instead of dropping it, cause it plays an important role in determining whether the passenger survived or not. Filling the age with just average value will not help, as passengers of different classes & different gender are from different age groups. So, we’ll fill the Age column accordingly.

As we can see in the data, Age varies with class and gender, the fill_age() function replaces null values with respect to class and gender that a specific entry has, now Embarked column consists of categorical values so filling it with the most frequent value makes sense.

After filling these 2 columns, you can see that now we don’t have any values.

Data Cleaning

Sometimes while working on a dataset, you might encounter some incorrect, corrupt or irrelevant values which needs to be dealt with. Due to these reasons, data cleaning is an important step in pre-processing of any dataset. 

For example, I have this real estate dataset that is scraped from the web, one column describes the property in which the number of bedrooms in that flat has been mentioned, but that’s not workable upon, so we want to extract only the exact number (int form) which we can be used for further calculations. For solving this issue, I wrote a function that takes the textual description and returns the number of bedrooms for that property.

 

 


After applying the function, now you can see how we’ve got the exact bedroom number without any other irrelevant character. Similarly, the price column also is in string format it contains the Rupee symbol, the unit in which price is mentioned like Crores or Lacs. Here also we need to clean and bring different units onto the same scale.
 

 

Here is how it looks after passing the Price column through the function we defined:
 

 

Data Analysis

After getting an understanding of the structure of the dataset, there are no pre-defined steps to perform analysis or for finding insights, you have to use your own skills and dive as deeper as you can and extract valuable information previously unknown to us. The way I’ve done is only of many ways in which it can be done.

As there is a column for gender, I tried to find out how having heart disease differs based on gender. So I plotted a pie chart.

 

 

  • This visual shows males are more likelier to get heart disease as compared to females.
  • Just being curious I looked about it on the web and that’s true. You can read more about it here.

Next, I wanted to see how Age and Max Heart Rate is related to the target variable. A scatter plot will help us to understand it better.

 

  • On the left the yellow points represent the Max Heart Rate for each individuals who have a heart disease and we can see they are little shifted towards the higher side, which indicates the heart rate increases during the disease period.
  • If the scatter plot is little difficult to understand, the bar chart on the right clearly shows the average heart rate of patients with heart disease is greater than that of patients not having heart disease.

Let’s see the distribution of the age, i.e which age group of people are most affected by these.

  • From the histogram plotted for age distribution of people with positive target values, we observed that the age group affected is 40 – 65, as other health issues like high BP, Diabetes, and High Cholesterol also contribute to this which are common in these age group.

Next, there is a column cp (chest pain) with four distinct values 0 to 4, a larger value means more chest pain and vice-versa, is this a notable symptom for someone before being diagnosed with heart disease we’ll see if that’s the case with a Bar chart.

  • It’s clear from the visual that the ratio of positive to negative values increases with how severe the chest pain is.
  • For chest pain level 0, the number of negative target values is more than double of positive target values, but that reverses as chest pain level goes up. So, remembering this might help in case someone around you is experiencing chest pain.

Correlation Matrix

To understand the relation between different attributes if the data, correlation matrix is often used, it is an matrix of size n x n, where n is the number of features in your dataset. Then we plot the correlation matrix like below I’ve done it for the heart disease data.

You can either check on the lower side of the highlighted diagonal or the upper side of it basically both are the same. Two variables can have a positive correlation in which an increase in value of one variable also increases the other one, a negative correlation where an increase in value of one variable causes decreases in the other one, and no correlation where it doesn’t affect other variables. The correlation value lies between -1 to 1, understanding it correctly helps us while Feature Engineering for further modeling purposes.

  • Here cp (chest pain), thalach (max heart rate achieved) & slope have a positive correlation with the target.
  • Whereas exang (exercise-induced angina) & oldpeak(ST depression induced by exercise relative to rest) have a negative correlation with the target.

Now you might have got an idea about how EDA is done from Data Cleaning to Data Pre-processing, there is more than what I’ve shown here like Data Collection, Feature Engineering, Modelling, Advanced Statistical Analysis, and Hyperparameter Tuning though as I already said there is no defined path you have to use your creativity.

What’s Next?

After Basic Analysis and Visualization when you’re well-versed with the data and the problem statement you’re working on, the next part is building a Machine Learning model to (with respect to this dataset) classify whether for given health parameters of a patient, if he/she has heart disease or not.

  • Aniket Jalasakare
  • Jul, 03 2022

Add New Comments

Please login in order to make a comment.

Recent Comments

Be the first to start engaging with the bis blog.