A Framework for Characterizing What Makes an Instance Hard to Classify


The health domain has been largely benefited by Machine Learning solutions, which can be used for building predictive models to support medical decisions. But, for increasing the reliability of these systems, it is important to understand when the models are prone to failures. In this paper, we investigate what can we learn from the instances of a dataset which are hard to classify by Machine Learning models. Different reasons may explain why one or a set of instances are misclassified, despite the predictive model used. They can be either noisy, anomalous or placed in overlapping regions, to name a few. Our framework works at two levels: the original base dataset and a meta-dataset built to reflect the hardness level of the instances. A two-dimensional hardness embedding is assembled, which can be visually inspected to determine sets of instances to scrutinize better. We show some analysis that can be undertaken in this hardness space that allow to characterize why some of the instances are hard to classify, with case studies on health datasets.
VALERIANO, Maria Gabriela; PAIVA, Pedro Yuri Arbs; KIFFER, Carlos Roberto Veiga; LORENA, Ana Carolina. A Framework for Characterizing What Makes an Instance Hard to Classify. In: BRAZILIAN CONFERENCE ON INTELLIGENT SYSTEMS (BRACIS), 12. , 2023, Belo Horizonte/MG. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2023 . p. 353-367. ISSN 2643-6264.