Data Analytics

Do you How to Handle Imbalanced Datasets?? Learn Here

What is an imbalanced dataset?

We call a dataset as imbalanced in a scenario where the number of observations for a certain class outweighs the number of observations of another class. It is a major hurdle in situations such as disease detection and identification of fraudulent transactions. Machine learning algorithms by themselves do not cater to any sort of imbalance in the data and as such, the predictive models that are built may give inaccurate and biased results. Since, we build models to improve accuracy; the class imbalance needs to be handled explicitly during the data processing stage.

Example,

Total observations à 5000

Patients with heart disease à 50

Patient without heart disease à 4500

Chance of heart disease/ Event rate à 1 %

Since, the event rate is under 5 %; this scenario is termed as a rare event. ML models that are built using such a skewed dataset would yield inaccurate results and would not be a suitable model to be deployed for heart disease prediction.

Techniques to handle imbalanced data:

1. Random Under-sampling

In this method, majority class data is randomly removed from the dataset to reduce the number of samples. This is generally advised for especially huge datasets.

For getting 1:1 class distribution:

For getting a distribution as per a pre-defined ratio:

Advantages:

a) Since the number of records decreases, it helps to increase processing and model building speed.

Disadvantages:

a) Potential information loss as records are outright removed from the initial dataset prior to model building.

b) Could still result in inaccurate results as the remaining dataset could accidentally become biased.

2. Random Over-sampling

In this method, minority class data is randomly replicated to create more records in the dataset. These are essentially duplicates.

For getting 1:1 class distribution:

For getting a distribution as per a pre-defined ratio:

Advantages:

a) There is no information loss as compared to under-sampling.

b) Yields better ML models than under-sampling

Disadvantages:

a) There is an increased chance of overfitting of the data since, over-sampling creates duplicated in the dataset so the duplication could have created records that result in a biased model.

3. SMOTE (Synthetic Minority Over-sampling technique)

This is a method that is an improved over-sampling technique and it aims to minimize overfitting that could be caused by simple random over-sampling. From the minority class, a segment/subset of data is used to synthetically create completely new but similar instances. These records are then added to the initial dataset, which is then used to train the ML models.

Advantages:

a) Minimizes the potential occurrence of overfitting as a result of random over-sampling

b) Data loss is prevented

Disadvantages:

a) For high dimensional data (number of features > number of observations), SMOTE is not a preferred method.

b) SMOTE doesn’t consider neighbouring features when creating synthetic samples so, overlapping of instances can occur resulting in noisy data.

4. MSMOTE (Modified Synthetic Minority Over-sampling technique)

As we discussed earlier, SMOTE doesn’t take into account neighbouring features during sample generation resulting in noisy data. So, it was modified to actually take those into consideration. In this technique, the minority class is further broken down into 3 non-overlapping groups:

a) Security/ Safe samples – these enhance the performance and prediction capabilities of a classifier

b) Latent noise samples – these data samples end up reducing the performance of the model

c) Border samples – points that cannot be categorized into the above two are referred to as border samples

This division into 3 different sub classes is done by calculating the distances between minority data samples and the training data samples.

Conclusion:

These are the most basic sampling techniques that can be leveraged to deal with imbalance data distribution. There are other advanced methods such as bagging based techniques and boosting based techniques such as AdaBoost, Gradient Tree Boost and XGBoost that can also be used. Also, these ensemble methods can be clubbed with the basic techniques to improve prediction performance and F-scores.

Random under-sampling, random over-sampling, SMOTE and MSMOTE are techniques that any beginner should fully understand before moving onto the more complex ensemble techniques for imbalanced learning.

For identification of fraudulent transactions, the dataset is almost always imbalanced. In such cases, SMOTEBoost (SMOTE + Gradient Boosting) or simply XGBoost are good techniques for handling the imbalance. Since, none of the techniques are guaranteed to give you fool-proof results; you might have to try out different combinations till you achieve results that meet your success criteria. So, in essence, YMMV.

Vivek Banerjee
Mar, 27 2022