Do you How to Handle Imbalanced Datasets?? Learn Here
What is an
imbalanced dataset?
We call a dataset as imbalanced in a scenario where the
number of observations for a certain class outweighs the number of observations
of another class. It is a major hurdle in situations such as disease detection
and identification of fraudulent transactions. Machine learning algorithms by
themselves do not cater to any sort of imbalance in the data and as such, the predictive
models that are built may give inaccurate and biased results. Since, we build
models to improve accuracy; the class imbalance needs to be handled explicitly
during the data processing stage.
Example,
Total observations à
5000
Patients with heart disease à
50
Patient without heart disease à
4500
Chance of heart disease/ Event rate à 1 %
Since, the event rate is under 5 %; this scenario is termed
as a rare event. ML models that are built using such a skewed dataset would
yield inaccurate results and would not be a suitable model to be deployed for
heart disease prediction.
Techniques to handle imbalanced data:
1. Random Under-sampling
In this method, majority class data is randomly removed from the dataset to reduce the number of samples. This is generally advised for especially huge datasets.
For getting 1:1 class distribution:
For getting a distribution as per a pre-defined ratio:
Advantages:
a) Since the number of records decreases, it helps to increase processing and model building speed.
Disadvantages:
a) Potential information loss as records are outright removed from the initial dataset prior to model building.
b) Could still result in inaccurate results as the remaining dataset could accidentally become biased.
2. Random Over-sampling
In this method, minority class data is randomly replicated to create more records in the dataset. These are essentially duplicates.
For getting 1:1 class distribution:
For getting a distribution as per a pre-defined ratio:
Advantages:
a) There is no information loss as compared to under-sampling.
b) Yields better ML models than under-sampling
Disadvantages:
a) There is an increased chance of overfitting of the data since, over-sampling creates duplicated in the dataset so the duplication could have created records that result in a biased model.
3. SMOTE (Synthetic Minority Over-sampling technique)
This is a method that is an improved over-sampling technique and it aims to minimize overfitting that could be caused by simple random over-sampling. From the minority class, a segment/subset of data is used to synthetically create completely new but similar instances. These records are then added to the initial dataset, which is then used to train the ML models.
Advantages:
a) Minimizes the potential occurrence of overfitting as a result of random over-sampling
b) Data loss is prevented
Disadvantages:
a) For high dimensional data (number of features > number of observations), SMOTE is not a preferred method.
b) SMOTE doesn’t consider neighbouring features when creating synthetic samples so, overlapping of instances can occur resulting in noisy data.
4. MSMOTE (Modified Synthetic Minority Over-sampling technique)
As we discussed earlier, SMOTE doesn’t take into account neighbouring features during sample generation resulting in noisy data. So, it was modified to actually take those into consideration. In this technique, the minority class is further broken down into 3 non-overlapping groups:
a) Security/ Safe samples – these enhance the performance and prediction capabilities of a classifier
b) Latent noise samples – these data samples end up reducing the performance of the model
c) Border samples – points that cannot be categorized into the above two are referred to as border samples
This division into 3 different sub classes is done by calculating the distances between minority data samples and the training data samples.
- Vivek Banerjee
- Mar, 27 2022