Acessibilidade / Reportar erro

Supervised learning with imbalanced data sets: an overview

Traditional learning algorithms induced by complex and highly imbalanced training sets may have difficulty in distinguishing between examples of the groups. The tendency is to create classification models that are biased toward the overrepresented (majority) class, resulting in a low rate of recognition for the minority group. This paper provides a survey of this problem which has attracted the interest of many researchers in recent years. In the scope of two-class classification tasks, concepts related to the nature of the imbalanced class problem and evaluation metrics are presented, including the foundations of the ROC (Receiver Operating Characteristic) analysis; plus a state of the art of the proposed solutions. At the end of the paper a brief discussion on how the subject can be extended to multiclass learning is provided.

imbalanced data sets; supervised learning; evaluation metrics; ROC analysis; resampling methods; costsensitive approach


Sociedade Brasileira de Automática Secretaria da SBA, FEEC - Unicamp, BLOCO B - LE51, Av. Albert Einstein, 400, Cidade Universitária Zeferino Vaz, Distrito de Barão Geraldo, 13083-852 - Campinas - SP - Brasil, Tel.: (55 19) 3521 3824, Fax: (55 19) 3521 3866 - Campinas - SP - Brazil
E-mail: revista_sba@fee.unicamp.br