What Is Class Imbalance In Machine Learning & How To Fix It

Avoid the frustration when dealing with Imbalanced real-world datasets and learn to fix them at ease

Dr. Ashish Bamania

Sep 10, 2023

What Is Class Imbalance?

Real-world datasets are messy (unlike the Scikit-Learn datasets).

Class imbalance arises when the distribution of examples across different classes is not uniform.

In other words, some classes have a lot more samples than others.

For example, think about a dataset where the task is to detect a rare lung disease on Chest X-rays (Binary Classification). Out of 10,000 patients, only 50 might have the disease, while 9,950 do not.

The same might apply to a dataset used for a Regression problem of predicting house prices in a city. Most houses are priced between $100,000 and $500,000, but there are a few luxury mansions priced at over $10 million.

Such datasets might skew the model training towards better detecting the majority class along with the inability to detect the minority class.

Choosing The Right Metrics For Model Training & Evaluation

Why Not To Choose Accuracy?

Accuracy is a common metric that is used for ML model training & evaluation.

However, on imbalanced datasets, high accuracy can be a misleading metric because a model can achieve a high accuracy score by predicting the majority class most of the time.

For example, imagine that we have a dataset that contains blood films of 100 patients who are tested for Malaria.

Out of 1,000 people tested, only 50 have malaria while 950 do not.

Now, let’s say that our model just predicts ‘No Malaria’ for every person without truly learning the underlying patterns in the dataset.

The confusion matrix for the above case is shown below:

                           Actual
                          ------------------------
                          | Malaria | No Malaria |
           ---------------------------------------
Predicted | Malaria       |    0    |     0      |
           ---------------------------------------
          | No Malaria    |   50    |    950     |
           ---------------------------------------

From the above:

True Positives (TP) = 0
True Negatives (TN) = 950
False Positives (FP) = 0
False Negatives (FN) = 50

Accuracy = (TP + TN) / (TP + TN + FP + FN) = (0 + 950) / 1,000 = 95%

The model achieves a 95% accuracy, which sounds pretty impressive but in a true sense, this model is completely useless as it fails entirely in detecting those with malaria.

Better Metrics For Imbalanced Datasets

Precision, Recall & F1-Score are the better metrics to use in the case of class imbalance in our dataset.

Let’s talk about them in detail.

Precision

Precision (also called the Positive Predictive Value/ PPV) measures the number of correctly predicted positive observations out of all the predicted positives.

In other words, it asks:

Of all the cases where we predicted ‘Malaria’, how many did we get right?
Precision = TP / (TP + FP) = 0 / (0 + 0)

Using the above confusion matrix, Precision is undefined or for practical purposes, it is 0.

Recall

Recall (also called Sensitivity or True Positive Rate) is the number of correctly predicted positive observations out of the actual positives.

In other words, it asks:

Of all the cases that truly have ‘Malaria’, how many did we detect?
Recall = TP/ (TP + FN) = 0 / (0 + 50) = 0

The Recall is 0%, which tells that our model did not detect any of the actual malaria cases.

F1-Score

The F1-Score is the harmonic mean of Precision and Recall.

It is a way to combine Precision and Recall into a single metric that captures them both.

F1- Score = 2 × ((Precision × Recall) / (Precision + Recall ))

For our model, the F1-Score is undefined because both Precision and Recall are 0.

Two important curves that are used to evaluate ML model performance are:

ROC curve
Precision-Recall curve (Better for imbalanced datasets)

Let’s talk about them in detail.

Receiver Operating Characteristic (ROC) Curve

When classification problems are modelled as regression problems, the probability of prediction of the model above a threshold can be used to classify a sample as belonging to the ‘Positive’ label.

A ROC curve is a plot of the True Positive Rate (TPR) or Recall against the False Positive Rate(FPR) or False Alarm at various threshold settings.

The curve starts at the point (0,0) and ends at the point (1,1).

The diagonal line from point (0,0) to point (1,1) represents the ROC curve of a random classifier.

The more the ROC curve is towards the top-left corner, the better is the classifier.

Area Under The ROC Curve (AUC-ROC)

The area under the ROC curve (AUC-ROC) provides a measure of the model’s ability to distinguish between the positive and negative classes.

A perfect classifier will have an AUC of 1.
A completely random classifier will have an AUC of 0.5.

Steepness of ROC Curve

The curve’s Steepness indicates performance
A steep curve means that the model has good recall
A gradual increase indicates more false positives

Why Not To Use The ROC Curve For Imbalanced Datasets?

Note that the ROC curve focuses only on the positive class and poorly tells how well a model performs on the negative class.

Therefore, for imbalanced datasets, the Precision-Recall (PR) curve is often considered better than the ROC curve.

Precision-Recall Curve

This curve is a plot of the Precision (Positive Predictive Value/ PPV) against the Recall (Sensitivity or True Positive Rate) for different threshold values.

The top-right corner of the plot where both Precision and Recall are 1 shows a perfect classifier.
A random classifier will have a precision equal to the proportion of positive samples.
The higher the area under the Precision-Recall curve (AUC-PR), the better the model.
A perfect classifier has an AUC-PR of 1, while a random classifier has an AUC-PR equal to the proportion of positive samples.

Handling Class Imbalanced Datasets

Class Imbalance can be addressed using:

Data level methods: To change the distribution of a given imbalanced dataset
Algorithm level methods: To make learning by your ML model more resilient to class imbalance

Let’s discuss these in detail.

Data Level Methods

Random Resampling

In this method, samples from the majority class are randomly removed (under-sampling), and samples of the minority class are randomly replicated (oversampling).

2. SMOTE Oversampling

SMOTE or Synthetic Minority Over-sampling Technique is a technique used to increase the samples in the minority class in the datset.

It works as follows:

For each sample in the minority class, its k-nearest neighbors are randomly chosen.
The difference between the feature vector of this sample and its chosen neighbor is calculated.
The difference is multiplied by a random number between 0 and 1.
The result is added to the feature vector of the sample. This generates a new point in the feature space.
The process is repeated till the class imbalance is resolved.

3. ADA-SYN Oversampling

ADA-SYN or Adaptive Synthetic Sampling is an extension of SMOTE.

It generates new samples adaptively based on the model’s difficulty in learning.

In regions where the minority class is surrounded by the majority class and is harder to learn, more synthetic data is generated.

This is how ADA-SYN Over sampling works:

For each sample in the minority class, For each sample in the minority class, its k-nearest neighbours are randomly chosen.
A ratio of class distribution of these neighbours is calculated to determine how many belong to the majority class
This ratio is used to compute a weight for each sample in the minority class. The more majority neighbour a sample has, the higher the weight.
For each minority class sample, synthetic samples are generated proportional to their weight (using steps from SMOTE).

4. Tomek Links Undersampling

This is an under-sampling technique where samples from the majority and minority classes that are the nearest neighbours to each other are selected. These pairs are called Tomek Links.

Once a Tomek Link is found, the majority class sample from the link is removed.

5. Edited Nearest Neighbors Rule Undersampling

For each sample in the dataset, its k-nearest neighbours are identified.

If the sample is from the majority class and most of its nearest neighbors belong to the minority class, this sample is removed.

6. Neighbour Cleaning Rule Undersampling

This is an under-sampling technique that combines the Edited Nearest Neighbors Rule and Tomek Links.

The above oversampling and under-sampling methods can be combined differently to address what works best for your use case.

Algorithm Level Methods

Cost-sensitive Learning

Rather than having a single cost for all classes, different misclassification costs are assigned to different classes.

For example, misclassifying a minority class (Cancer on Chest X-ray) has a higher cost than misclassifying a majority class (Normal Chest-X ray) sample.

This makes the learning algorithm more cautious of misclassifying the minority class.

2. Class Balanced Loss

The idea behind Class-balanced loss is to assign different weights to different classes.

This weight is inversely proportional to number of samples in that class.

This means that the minority class gets a higher weight, and the majority class gets a lower weight.

When computing the loss, each sample’s contribution to loss is scaled by its class weight.

This means that misclassifying a minority class sample will result in a higher penalty as compared to misclassifying a majority class instance.

3. Focal Loss

The intuition behind Focal Loss is to focus on learning the samples that the model gets wrong rather than the ones that it can confidently predict.

Focal Loss adds a factor (1 − p) ** γ to the standard cross entropy criterion.

This discounts the loss contribution from easy-to-learn samples.

Source: Focal Loss for Dense Object Detection (Arxiv)

4. Ensembling Techniques

The idea behind these techniques is that the combination of multiple models can achieve better performance than any single model can.

In a Balanced Random Forest, each Decision tree is trained on a balanced subset of the data created by under-sampling the majority class. The final prediction is made by aggregating the predictions of all trees.

In the EasyEnsemble method, multiple balanced subsets are created from the original dataset by randomly under-sampling the majority class. An AdaBoost classifier is then trained on each balanced subset and the final prediction is made by aggregating the predictions of each AdaBoost classifier.

Other popular Ensembling techniques are SMOTEBoost & RUSBoost.

Check out Anna Vasilyeva’s article that discusses them in detail here.

Into AI

Discussion about this post