多类别不平衡分类-解决方案：概述【集成学习、数据重采样、深度学习(元学习)、异常检测】

类别不平衡（又称长尾问题）是指在分类问题中，类别之间的表示质量/样本数量不平等。类别不平衡在实践中广泛存在，例如金融欺诈检测、入侵检测、医疗辅助诊断等罕见模式识别任务。类的不平衡往往会导致传统机器学习算法的预测性能下降。类别不平衡学习旨在解决这一问题，即从不平衡的数据中学习一个无偏的预测模型。imbalanced-ensemble [Github][Documentation][Gallery][

u013250861

520人浏览 · 2023-04-04 17:42:22

u013250861 · 2023-04-04 17:42:22 发布

类别不平衡（又称长尾问题）是指在分类问题中，类别之间的表示质量/样本数量不平等。

类别不平衡在实践中广泛存在，例如金融欺诈检测、入侵检测、医疗辅助诊断等罕见模式识别任务。

类的不平衡往往会导致传统机器学习算法的预测性能下降。类别不平衡学习旨在解决这一问题，即从不平衡的数据中学习一个无偏的预测模型。

框架与库 | Frameworks and Libraries

Python

imbalanced-ensemble [Github][Documentation][Gallery][Paper]

NOTE: written in python, easy to use.
- imbalanced-ensemble is a Python toolbox for quick implementing and deploying ensemble learning algorithms on class-imbalanced data. It is featured for:
  - (i) Unified, easy-to-use APIs, detailed documentation and examples.
  - (ii) Capable for multi-class imbalanced learning out-of-box.
  - (iii) Optimized performance with parallelization when possible using joblib.
  - (iv) Powerful, customizable, interactive training logging and visualizer.
  - (v) Full compatibility with other popular packages like scikit-learn and imbalanced-learn.
- Currently (v0.1.4), it includes more than 15 ensemble algorithms based on re-sampling and cost-sensitive learning (e.g., SMOTEBoost/Bagging, RUSBoost/Bagging, AdaCost, EasyEnsemble, BalanceCascade, SelfPacedEnsemble, ...).
imbalanced-learn [Github][Documentation][Paper]

NOTE: written in python, easy to use.
- imbalanced-learn is a python package offering a number of re-sampling techniques commonly used in datasets showing strong between-class imbalance. It is compatible with scikit-learn and is part of scikit-learn-contrib projects.
- Currently (v0.8.0), it includes 21 different re-sampling techniques, including over-sampling, under-sampling and hybrid ones (e.g., SMOTE, ADASYN, TomekLinks, NearMiss, OneSideSelection, SMOTETomek, ...)
- This package also provides many utilities, e.g., Batch generator for Keras/TensorFlow, see API reference.
smote_variants [Documentation][Github] - A collection of 85 minority over-sampling techniques for imbalanced learning with multi-class oversampling and model selection features (All writen in Python, also support R and Julia).

R

smote_variants [Documentation][Github] - A collection of 85 minority over-sampling techniques for imbalanced learning with multi-class oversampling and model selection features (All writen in Python, also support R and Julia).
caret [Documentation][Github] - Contains the implementation of Random under/over-sampling.
ROSE [Documentation] - Contains the implementation of ROSE (Random Over-Sampling Examples).
DMwR [Documentation] - Contains the implementation of SMOTE (Synthetic Minority Over-sampling TEchnique).

Java

KEEL [Github][Paper] - KEEL provides a simple GUI based on data flow to design experiments with different datasets and computational intelligence algorithms (paying special attention to evolutionary algorithms) in order to assess the behavior of the algorithms. This tool includes many widely used imbalanced learning techniques such as (evolutionary) over/under-resampling, cost-sensitive learning, algorithm modification, and ensemble learning methods.

NOTE: wide variety of classical classification, regression, preprocessing algorithms included.

Scalar

undersampling [Documentation][Github] - A Scala library for under-sampling and their ensemble variants in imbalanced classification.

Julia

smote_variants [Documentation][Github] - A collection of 85 minority over-sampling techniques for imbalanced learning with multi-class oversampling and model selection features (All writen in Python, also support R and Julia).

研究论文 | Research Papers

综述 | Surveys

Learning from imbalanced data (IEEE TKDE, 2009, 6000+ citations) [Paper]
- Highly cited, classic survey paper. It systematically reviewed the popular solutions, evaluation metrics, and challenging problems in future research in this area (as of 2009).
Learning from imbalanced data: open challenges and future directions (2016, 900+ citations) [Paper]
- This paper concentrates on the open issues and challenges in imbalanced learning, i.e., extreme class imbalance, imbalance in online/stream learning, multi-class imbalanced learning, and semi/un-supervised imbalanced learning.
Learning from class-imbalanced data: Review of methods and applications (2017, 900+ citations) [Paper]
- A recent exhaustive survey of imbalanced learning methods and applications, a total of 527 papers were included in this study. It provides several detailed taxonomies of existing methods and also the recent trend of this research area.

集成学习 | Ensemble Learning

通用集成框架 | General ensemble

Self-paced Ensemble (ICDE 2020, 20+ citations) [Paper][Code][Slides][Zhihu/知乎][PyPI]

NOTE: versatile solution with outstanding performance and computational efficiency.
MESA: Boost Ensemble Imbalanced Learning with MEta-SAmpler (NeurIPS 2020) [Paper][Code][Video][Zhihu/知乎]

NOTE: learning an optimal sampling policy directly from data.
Exploratory Undersampling for Class-Imbalance Learning (IEEE Trans. on SMC, 2008, 1300+ citations) [Paper]

NOTE: simple but effective solution.
- EasyEnsemble [Code]
- BalanceCascade [Code]

基于 Boosting 的方法 | Boosting-based

AdaBoost (1995, 18700+ citations) [Paper][Code] - Adaptive Boosting with C4.5
DataBoost (2004, 570+ citations) [Paper] - Boosting with Data Generation for Imbalanced Data
SMOTEBoost (2003, 1100+ citations) [Paper][Code] - Synthetic Minority Over-sampling TEchnique Boosting
MSMOTEBoost (2011, 1300+ citations) [Paper] - Modified Synthetic Minority Over-sampling TEchnique Boosting
RAMOBoost (2010, 140+ citations) [Paper] [Code] - Ranked Minority Over-sampling in Boosting
RUSBoost (2009, 850+ citations) [Paper] [Code] - Random Under-Sampling Boosting
AdaBoostNC (2012, 350+ citations) [Paper] - Adaptive Boosting with Negative Correlation Learning
EUSBoost (2013, 210+ citations) [Paper] - Evolutionary Under-sampling in Boosting

基于 Bagging 的方法 | Bagging-based

Bagging (1996, 20000+ citations) [Paper][Code] - Bagging predictor
Diversity Analysis on Imbalanced Data Sets by Using Ensemble Models (2009, 400+ citations) [Paper]
- UnderBagging [Code]
- OverBagging [Code]
- SMOTEBagging [Code]

基于代价敏感学习的方法 | Cost-sensitive ensemble

AdaCost (ICML 1999, 800+ citations) [Paper][Code] - Misclassification Cost-sensitive boosting
AdaUBoost (NIPS 1999, 100+ citations) [Paper][Code] - AdaBoost with Unequal loss functions
AsymBoost (NIPS 2001, 700+ citations) [Paper][Code] - Asymmetric AdaBoost and detector cascade

数据重采样 | Data resampling

过采样 | Over-sampling

ROS [Code] - Random Over-sampling
SMOTE (2002, 9800+ citations) [Paper][Code] - Synthetic Minority Over-sampling TEchnique
Borderline-SMOTE (2005, 1400+ citations) [Paper][Code] - Borderline-Synthetic Minority Over-sampling TEchnique
ADASYN (2008, 1100+ citations) [Paper][Code] - ADAptive SYNthetic Sampling
SPIDER (2008, 150+ citations) [Paper][Code(Java)] - Selective Preprocessing of Imbalanced Data
Safe-Level-SMOTE (2009, 370+ citations) [Paper][Code(Java)] - Safe Level Synthetic Minority Over-sampling TEchnique
SVM-SMOTE (2009, 120+ citations) [Paper][Code] - SMOTE based on Support Vectors of SVM
MDO (2015, 150+ citations) [Paper][Code] - Mahalanobis Distance-based Over-sampling for Multi-Class imbalanced problems.

NOTE: See more over-sampling methods at smote-variants.

欠采样 | Under-sampling

RUS [Code] - Random Under-sampling
CNN (1968, 2100+ citations) [Paper][Code] - Condensed Nearest Neighbor
ENN (1972, 1500+ citations) [Paper] [Code] - Edited Condensed Nearest Neighbor
TomekLink (1976, 870+ citations) [Paper][Code] - Tomek's modification of Condensed Nearest Neighbor
NCR (2001, 500+ citations) [Paper][Code] - Neighborhood Cleaning Rule
NearMiss-1 & 2 & 3 (2003, 420+ citations) [Paper][Code] - Several kNN approaches to unbalanced data distributions.
CNN with TomekLink (2004, 2000+ citations) [Paper][Code(Java)] - Condensed Nearest Neighbor + TomekLink
OSS (2007, 2100+ citations) [Paper][Code] - One Side Selection
EUS (2009, 290+ citations) [Paper] - Evolutionary Under-sampling
IHT (2014, 130+ citations) [Paper][Code] - Instance Hardness Threshold

混合采样 | Hybrid-sampling

A Study of the Behavior of Several Methods for Balancing Training Data (2004, 2000+ citations) [Paper]

NOTE: extensive experimental evaluation involving 10 different over/under-sampling methods.
- SMOTE-Tomek [Code]
- SMOTE-ENN [Code]
SMOTE-RSB (2012, 210+ citations) [Paper][Code] - Hybrid Preprocessing using SMOTE and Rough Sets Theory
SMOTE-IPF (2015, 180+ citations) [Paper][Code] - SMOTE with Iterative-Partitioning Filter

代价敏感学习 | Cost-sensitive Learning

CSC4.5 (2002, 420+ citations) [Paper][Code(Java)] - An instance-weighting method to induce cost-sensitive trees
CSSVM (2008, 710+ citations) [Paper][Code(Java)] - Cost-sensitive SVMs for highly imbalanced classification
CSNN (2005, 950+ citations) [Paper][Code(Java)] - Training cost-sensitive neural networks with methods addressing the class imbalance problem.

深度学习 | Deep Learning

综述 | Surveys

A systematic study of the class imbalance problem in convolutional neural networks (2018, 330+ citations) [Paper]
Survey on deep learning with class imbalance (2019, 50+ citations) [Paper]

NOTE: a recent comprehensive survey of the class imbalance problem in deep learning.

图数据挖掘 | Graph Neural Networks

GraphSMOTE: Imbalanced Node Classification on Graphs with Graph Neural Networks (WSDM 2021) [Paper][Code]
Topology-Imbalance Learning for Semi-Supervised Node Classification (NeurIPS 2021) [Paper][Code]
GraphENS: Neighbor-Aware Ego Network Synthesis for Class-Imbalanced Node Classification (ICLR 2022) [Paper][Code]
LTE4G: Long-Tail Experts for Graph Neural Networks (CIKM 2022) [Paper][Code]

难例挖掘 | Hard example mining

Training region-based object detectors with online hard example mining (CVPR 2016, 840+ citations) [Paper][Code] - In the later phase of NN training, only do gradient back-propagation for "hard examples" (i.e., with large loss value)

损失函数设计 | Loss function engineering

Focal loss for dense object detection (ICCV 2017, 2600+ citations) [Paper][Code (detectron2)][Code (unofficial)] - A uniform loss function that focuses training on a sparse set of hard examples to prevents the vast number of easy negatives from overwhelming the detector during training.

NOTE: elegant solution, high influence.
Training deep neural networks on imbalanced data sets (IJCNN 2016, 110+ citations) [Paper] - Mean (square) false error that can equally capture classification errors from both the majority class and the minority class.
Deep imbalanced attribute classification using visual attention aggregation (ECCV 2018, 30+ citation) [Paper][Code]
Imbalanced deep learning by minority class incremental rectification (TPAMI 2018, 60+ citations) [Paper] - Class Rectification Loss for minimizing the dominant effect of majority classes by discovering sparsely sampled boundaries of minority classes in an iterative batch-wise learning process.
Learning Imbalanced Datasets with Label-Distribution-Aware Margin Loss (NIPS 2019, 10+ citations) [Paper][Code] - A theoretically-principled label-distribution-aware margin (LDAM) loss motivated by minimizing a margin-based generalization bound.
Gradient harmonized single-stage detector (AAAI 2019, 40+ citations) [Paper][Code] - Compared to Focal Loss, which only down-weights "easy" negative examples, GHM also down-weights "very hard" examples as they are likely to be outliers.
Class-Balanced Loss Based on Effective Number of Samples (CVPR 2019, 70+ citations) [Paper][Code] - a simple and generic class-reweighting mechanism based on Effective Number of Samples.
Influence-Balanced Loss for Imbalanced Visual Classification (ICCV 2021) [Paper][Code]
AutoBalance: Optimized Loss Functions for Imbalanced Data (NeurIPS 2021) [Paper]
Label-Imbalanced and Group-Sensitive Classification under Overparameterization (NeurIPS 2021) [Paper][Code]

元学习 | Meta-learning

Learning to model the tail (NIPS 2017, 70+ citations) [Paper] - Transfer meta-knowledge from the data-rich classes in the head of the distribution to the data-poor classes in the tail.
Learning to reweight examples for robust deep learning (ICML 2018, 150+ citations) [Paper][Code] - Implicitly learn a weight function to reweight the samples in gradient updates of DNN.

NOTE: representative work to solve the class imbalance problem through meta-learning.
Meta-weight-net: Learning an explicit mapping for sample weighting (NIPS 2019) [Paper][Code] - Explicitly learn a weight function (with an MLP as the function approximator) to reweight the samples in gradient updates of DNN.
Learning Data Manipulation for Augmentation and Weighting (NIPS 2019) [Paper][Code]
Learning to Balance: Bayesian Meta-Learning for Imbalanced and Out-of-distribution Tasks (ICLR 2020) [Paper][Code]
MESA: Boost Ensemble Imbalanced Learning with MEta-SAmpler (NeurIPS 2020) [Paper][Code][Video]

NOTE: meta-learning-powered ensemble learning

表示学习 | Representation Learning

Learning deep representation for imbalanced classification (CVPR 2016, 220+ citations) [Paper]
Supervised Class Distribution Learning for GANs-Based Imbalanced Classification (ICDM 2019) [Paper]
Decoupling Representation and Classifier for Long-tailed Recognition (ICLR 2020) [Paper][Code]

NOTE: interesting findings on representation learning and classifier learning
Supercharging Imbalanced Data Learning With Energy-based Contrastive Representation Transfer (NeurIPS 2021) [Paper]

后验概率校准 | Posterior Recalibration

Posterior Re-calibration for Imbalanced Datasets (NeurIPS 2020) [Paper][Code]
Long-tail learning via logit adjustment (ICLR 2021) [Paper][Code]

半监督/自监督学习 | Semi/Self-supervised Learning

Rethinking the Value of Labels for Improving Class-Imbalanced Learning (NeurIPS 2020) [Paper][Code][Video]

NOTE: semi-supervised training / self-supervised pre-training helps imbalance learning
Distribution Aligning Refinery of Pseudo-label for Imbalanced Semi-supervised Learning (NeurIPS 2020) [Paper][Code]
ABC: Auxiliary Balanced Classifier for Class-imbalanced Semi-supervised Learning (NeurIPS 2021) [Paper][Code]
Improving Contrastive Learning on Imbalanced Data via Open-World Sampling (NeurIPS 2021) [Paper]
DASO: Distribution-Aware Semantics-Oriented Pseudo-label for Imbalanced Semi-Supervised Learning (CVPR 2022) [Paper][Code]

课程学习 | Curriculum Learning

Dynamic Curriculum Learning for Imbalanced Data Classification (ICCV 2019) [Paper]

双阶段训练 | Two-phase Training

Brain tumor segmentation with deep neural networks (2017, 1200+ citations) [Paper][Code (unofficial)]

Pre-training on balanced dataset, fine-tuning the last output layer before softmax on the original, imbalanced data.

网络结构 | Network Architecture

BBN: Bilateral-Branch Network with Cumulative Learning for Long-Tailed Visual Recognition (CVPR 2020) [Paper][Code]
Class-Imbalanced Deep Learning via a Class-Balanced Ensemble (TNNLS 2021) [Paper]

深度生成网络 | Deep Generative Model

Deep Generative Model for Robust Imbalance Classification (CVPR 2020) [Paper]

不平衡回归 | Imbalanced Regression

Delving into Deep Imbalanced Regression (ICML 2021) [Paper][Code][Video]
Density-based weighting for imbalanced regression (Machine Learning [J], 2021) [Paper][Code]

异常检测 | Anomaly Detection

综述 | Surveys
- Anomaly detection: A survey (ACM computing surveys, 2009, 9000+ citations) [Paper]
- A survey of network anomaly detection techniques (2017, 700+ citations) [Paper]
基于分类的方法 | Classification-based
- One-class SVMs for document classification (JMLR, 2001, 1300+ citations) [Paper]
- One-class Collaborative Filtering (ICDM 2008, 1000+ citations) [Paper]
- Isolation Forest (ICDM 2008, 1000+ citations) [Paper]
- Anomaly Detection using One-Class Neural Networks (2018, 200+ citations) [Paper]
- Anomaly Detection with Robust Deep Autoencoders (KDD 2017, 170+ citations) [Paper]

杂项 | Miscellaneous

数据集 | Datasets

imbalanced-learn datasets

This collection of datasets is from imblearn.datasets.fetch_datasets.

ID	Name	Repository & Target	Ratio	#S	#F
1	ecoli	UCI, target: imU	8.6:1	336	7
2	optical_digits	UCI, target: 8	9.1:1	5,620	64
3	satimage	UCI, target: 4	9.3:1	6,435	36
4	pen_digits	UCI, target: 5	9.4:1	10,992	16
5	abalone	UCI, target: 7	9.7:1	4,177	10
6	sick_euthyroid	UCI, target: sick euthyroid	9.8:1	3,163	42
7	spectrometer	UCI, target: > =44	11:1	531	93
8	car_eval_34	UCI, target: good, v good	12:1	1,728	21
9	isolet	UCI, target: A, B	12:1	7,797	617
10	us_crime	UCI, target: >0.65	12:1	1,994	100
11	yeast_ml8	LIBSVM, target: 8	13:1	2,417	103
12	scene	LIBSVM, target: >one label	13:1	2,407	294
13	libras_move	UCI, target: 1	14:1	360	90
14	thyroid_sick	UCI, target: sick	15:1	3,772	52
15	coil_2000	KDD, CoIL, target: minority	16:1	9,822	85
16	arrhythmia	UCI, target: 06	17:1	452	278
17	solar_flare_m0	UCI, target: M->0	19:1	1,389	32
18	oil	UCI, target: minority	22:1	937	49
19	car_eval_4	UCI, target: vgood	26:1	1,728	21
20	wine_quality	UCI, wine, target: <=4	26:1	4,898	11
21	letter_img	UCI, target: Z	26:1	20,000	16
22	yeast_me2	UCI, target: ME2	28:1	1,484	8
23	webpage	LIBSVM, w7a, target: minority	33:1	34,780	300
24	ozone_level	UCI, ozone, data	34:1	2,536	72
25	mammography	UCI, target: minority	42:1	11,183	6
26	protein_homo	KDD CUP 2004, minority	111:1	145,751	74
27	abalone_19	UCI, target: 19	130:1	4,177	10

Imbalanced Databases

Link: GitHub - gykovacs/common_datasets: machine learning databases

Github 项目 | Github Repositories

算法实现 & 实用程序 & 教程 | Algorithms & Utilities & Jupyter Notebooks

imbalanced-algorithms - Python-based implementations of algorithms for learning on imbalanced data.
imbalanced-dataset-sampler - A (PyTorch) imbalanced dataset sampler for oversampling low frequent classes and undersampling high frequent ones.
class_imbalance - Jupyter Notebook presentation for class imbalance in binary classification.
Multi-class-with-imbalanced-dataset-classification - Perform multi-class classification on imbalanced 20-news-group dataset.
Advanced Machine Learning with scikit-learn: Imbalanced classification and text data - Different approaches to feature selection, and resampling methods for imbalanced data.