Handling Imbalanced Data: Oversampling v/s Undersampling

Introduction

Working with imbalanced data when building a classification ML model can be a challenge, as it affects your models’ ability to make accurate predictions. This is particularly challenging when the focus goes beyond the accuracy score of the model. Ultimately, you want a balanced dataset to give your model sufficient data to train with.

There are quite a number of solutions to this problem, however, this article focuses on how the Resampling technique: Under-sampling and oversampling solves this problem.

What is imbalanced data?

Imbalanced data are datasets with uneven distribution of cases in the target class. This means a disparity between positive and negative cases, with one case occurring significantly more than the other. For instance, if the target class has 90% positive instances and 10% negative instances, the dataset is said to be imbalanced.

This is a common classification problem, and the bias in the training dataset influences the machine learning algorithm. A typical solution to this is to gather more data, however, you can choose to employ various techniques to achieve balanced data.

Oversampling

This is a sampling technique that involves creating additional instances for the minority class to balance the dataset. In this case, more sample data are added to the minority class to increase its representation and present balanced data for training the model. The aim is to prevent the model from being biased towards the majority class. For instance, if the number of positive cases is 52, and the number of negative cases is 200, the oversampling technique works by randomly adding sample data to the number of positive cases to increase its data size to about 200, thus balancing the data.

Techniques for over-sampling:

Random Oversampling: In this case, examples of the minority classes in the training dataset are selected at random and duplicated until it reaches a similar number of samples as the majority class. Here is what your typical code should look like:

from imblearn.over_sampling import RandomOverSampler

# Create an instance of RandomOverSampler
ros = RandomOverSampler(random_state=42)

# Perform random oversampling on the training set
X_train_oversampled, y_train_oversampled = ros.fit_resample(X_train, y_train)

Here, we created an instance of RandomOverSamplercalled ros with a specified random_state for reproducibility. We then went ahead to fit our training set using ros.fit_resample.

While this is a simple approach, it is important to note that there is a risk of overfitting the model if poorly implemented.

Synthetic Minority Over-sampling Technique (SMOTE): Here, artificial(synthetic) samples of the minority class are generated to increase its representation. The technique works by first randomly selecting an instance(N) in the minority class, it then identifies the K nearest neighbors (by default 5) of each instance, and then it draws a line between the two points (k — N), and finally, it randomly picks a single point along that line as a syntenic/artificial sample.

This looks like a lot of explanations, you shouldn't worry much about it, as it is easily achieved with a few lines of code in Python. Here is an example:

from imblearn.over_sampling import SMOTE

# Create an instance of SMOTE
smote = SMOTE(random_state=42)

# Perform SMOTE oversampling on the training set

X_train_oversampled, y_train_oversampled =smote.fit_resample(X_train, y_train)

Notice how similar it is the code here is to the random oversampling technique. We import SMOTE from the imblearn.over_sampling module, and fit our training data to achieve balanced data.

Be careful when implementing the SMOTE technique, as if there is no clear separation between the minority and majority class, generating synthetic samples using SMOTE may introduce noise and mislead the model.

Undersampling

This technique is the opposite of the oversampling technique. Here, rather than generating samples of the minor class to balance the data, instances of the major class are randomly removed to equate the class with the minor class. Following our previous example, rather than increasing the positive class to 200, the negative class is reduced to 52.

Unlike the oversampling method which tries to come up with data, this method uses already existing data with no extra introductions. The downside however is that certain information in the dataset would be lost.

There are several techniques for undersampling, for this write-up, we will discuss two:

Random Undersampling: Here subsets of instances of the majority class are randomly selected to match the number of instances in the minority class. This is what your Python code would look like:

from imblearn.under_sampling import RandomUnderSampler

# Create an instance of RandomUnderSampler
rus = RandomUnderSampler(random_state=42)

# Perform random undersampling on the training set
X_train_undersampled, y_train_undersampled =rus.fit_resample(X_train, y_train)

Again, the code looks similar to our previous examples, imports the model, fits it to the train data, and it produces balanced data.

NearMiss: Unlike the random undersampling technique that picks samples at random, this technique uses a pattern. Here, instances from the majority class are selected based on their proximity(closeness) to the instances in the minority class. There are three versions of this technique, named NearMiss-1, NearMiss-2, and NearMiss-3.

NearMiss-1 selects examples from the majority class that have the smallest average distance to the three closest examples from the minority class.
NearMiss-2 selects examples from the majority class that have the smallest average distance to the three furthest examples from the minority class.
NearMiss-3 selects a given number of majority class examples for each example in the minority class that is closest.

Again, this isn't as complicated as it looks, especially because it can be achieved using a few lines of code in Python. Here is what the code looks like:

from imblearn.under_sampling import NearMiss

# Create an instance of NearMiss

nm = NearMiss(version=1)

# Perform NearMiss undersampling on the training set

X_train_undersampled, y_train_undersampled =nm.fit_resample(X_train, y_train)

Notice how similar this is to all our previous codes, however, there is a slight difference here, we are allowed to specify the version of NearMiss to use (1,2 or 3).

Implementing these techniques largely depends on your dataset, different situations would require different techniques, including techniques not listed here. Take caution while choosing a technique that works best for you. Moreso, avoid using these techniques on the test data or the overall dataset, as you want to keep an unmanipulated version of the dataset to test your models with.

Conclusion:

Tackling imbalanced data in a classification model isn’t exactly rare, and isn’t as complicated as it might seem at first glance. With the right implementation of the oversampling and undersampling techniques, you can achieve a balanced dataset.