How to Handle Imbalanced Datasets in Machine Learning Projects | by Jiayan Yin | Oct, 2024

0
5


Techniques to handle imbalanced datasets, examples, and Python snippets

Towards Data Science
Photo by Nick Fewings on Unsplash

Imagine that you’ve trained a predictive model with an accuracy score as high as 0.9. The evaluation metrics like precision, recall and f1-score also appear promising. But your experience and intuition told you that something isn’t right so you did further investigation and found this:

Image_1 — Screenshot by the author

The model’s seemingly strong performance is driven by the majority class 0 in its target variable. Due to the evident imbalance between the majority and minority classes, the model excels at predicting its majority class 0 while the performance of the minority class 1 is far from satisfactory. However, because class 1 represents a very small portion of the target variable, its performance has little impact on the overall scores of these evaluation metrics, which gives you an illusion that the model is strong.

This is not a rare case. On the contrary, data scientists frequently come across imbalanced datasets in the real-world projects. An imbalanced dataset refers to a dataset where the classes or categories are not