Techniques to handle Class Imbalance in data

There are different types of techniques to be followed if class imbalance is visible in the given data set. What is class imbalance and when to be considered as class imbalance is the key, and yet at times it is quite interesting to apply oversampling and undersampling techniques on given data sets.

Class-imbalance problem

There are no specific/particular equations to tell if the problem is a class imbalance or not. In some situations it is called 98:2 distribution of target variable for a binomial case is called class imbalance. In some data sets oversampling may be required even for 80:20 scenario.

To my suggestions, it is not advised to adjust the samples unless you have no other options. It is to the knowledge of the Data scientist to understand which technique below is applicable for the given problem.

Most common data sets which have class imbalance are :

  1. Fraud Detection
  2. Product Categorization
  3. Clinical Data prediction
  4. Fault detection

There are three common techniques in handling the imbalanced data

  1. Modifying the Loss Function / Assigning class weights
  2. Re-sampling the Data
  3. Ensemble methods

Asymmetric Loss Function

The technique behind this is giving more weights to the minority class and it is in the other way giving less weights to the majority class. With this technique model will get more visibility on minority class too along with the majority class.

This can be implemented using SKlearn package in Python. Using Class_weight hyperparameter in any Classifier we can adjust the weights of the class or levels.

The above picture depicts how the logistic regression fitted in separating the majority and minority. The observed image says the orange colored dots are majority and green ones are minority class

Re-sampling Technique :

These techniques are used to create re samples by following different approaches in either increasing or decreasing the the total number of samples in the data-set .

Random Under-sampling :

This is an under-sampling technique which will randomly remove the samples from the majority class of samples . Imagine a data set of n rows with p attributes . Out of which m rows are majority class and n-m are from minority class

Data set = n x p ( n rows , p attributes)

no.of.rows of majority class = m

no.of.rows of minority class = n-m

Random under sampling will remove the samples from the majority class randomly . Ratio can be defined to set the expectation to the random under sampler of how many samples to remove . Disadvantage with this method is there is always a good amount of probability of loosing the actual information and retaining the outliers in the samples in most of the cases . If you are lucky enough it may end up in giving good samples.

NEARMISS – 1:

Nearmiss technique is mainly base on K-Nearest Neighbors approach. In Nearmiss algorithm the technique used is to calculate the mean distances from majority class to the minority class and retain the points whose mean distance from majority class to minority class is lowest by ranking them in order. Based on the ratio specified it will remove data points in majority class which are farthest from the minority class. This Nearmiss-1 will try to retain the data that is close to the decision boundary. It works well on more scattered data. K or size of neighbor hood will be our hyper parameter .

Nearmiss-2:

Nearmiss -2 will work similar to Nearmiss-1. Nearmiss 2 algorithm works by , instead of looking of K-Nearest points, it will look at the K-farthest points and rank them in the farthest first order. Based on the sampling ratio provided it will remove the first n points from the majority class. This is more helpful in removing outliers and re sampled data is more concentrated in the center.

Nearmiss-3 :

Nearmiss – 3 technique will pick K-nearest Neighbors to each points in the minority class and retain those majority class points which lie in those neighborhood . Here we cannot choose the ratio, even if we choose the ration that won’t be of much help as neighborhood defines the ratio. If we use this technique the resultant data will have more overlapping between the classes and won’t be of much help if we are using logistic regression model.

Condensed Nearest Neighbor (CNN) :

CNN(condensed nearest neighbor) try to reduce or remove the points in the majority class which are closer to the points in the majority class. CNN algorithm randomly selects a point in random and scans for all the nearest neighbor from that point and try to eliminate them. This process will be repeated until the sample ratio is reached or there is minimum balance between the class is reached.

Note :-
As this process is done by selecting a point in random points as a result whole process will go on random. Because of this reason this algorithm is a high variance one.

Over all those this will take longest of all the re sampling algorithms.

Edited Nearest Neighbor (ENN) :

ENN algorithm will remove the nearest neighbors whose class label is different from the majority K-nearest neighbors.

The process is repeated until none of the points can be edited from the sample set. This is called Repeated ENN. This method is one of the most commonly used .

Tomek Link Removal:

Tomek link is a pair of points in the sample, which are each other’s Nearest Neighbors but labeled as different class. This technique is basically used to remove the noisy and borderline example from the sample to get a better decision boundary.

A variant of this will only remove the Tomek link of the majority class.

Over Sampling :

So far the techniques which have seen above are of under sampling. Undersampling the majority class is the one way to answer for class imbalance which means we are removing the majority class imagining the retained samples after removal would be providing enough information to the model trained on it .

Rather removing the samples we can go the other way of increasing the samples of minority class with over sampling techniques are described here.

Random OverSampling :

Random Over sampling will randomly generate the samples by adding little bit of jitters to the existing samples in the minority class. This may cause overlapping of the data points and not much helpful always. This may lead to overfitting.

Symmetric Minority oversampling Technique :(SMOTE)

SMOTE will randomly select a sample from the minority class then finds a nearest Neighbor to it and draws a imaginary line between the sample points , crates new sample point on that line. This process is repeated until the desired ratio of the classes is reached.

Combination of over sampling and under sampling :

This will be much useful with the data sets with huge data imbalance and also deals with the untidy data and noisy data.

SMOTE + Tomek link Removal :

This Technique will combine the SMOTE and TOMEK together. This technique will under sample the majority class as well as over sample the minority class using both of the techniques together.

SMOTE + ENN :

Here we will Combine both SMOTE and ENN together. This is very much useful when you are using linear classifiers for fitting the sample data.

Ensemble Sampling :

Ensemble sampling models internally uses bootstrapping technique.

Here we take the data and sample it n times randomly, then stores it in an np.array . Then train the Adaboost classifier on the each sample for all n samples and then combine these classifiers additively.

This will perform extremely well on large datasets.

These are the most common and most used techniques for dealing with class imbalance problem.

There are different other techniques like (Borderline SMOTE , SMOTEBoost, Adac , Kernel based methods and Jackknife ,…..) which you can try implementing. But there is no perfect technique which works well for all kinds of data. You try implementing multiple ones and the see which one will work fine for your data.

Reference for code:

https://github.com/SaiSubrahmanyamJanapati/Classimbalance_Techniques

Authors :
R.V Suresh
J. Sai Subrahmanyam



Published by

RV.Suresh

Passionate to Learn New Technologies

One thought on “Techniques to handle Class Imbalance in data”

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s