Hardness Sampling: Exploring Instance Hardness in Pool-Based Active Learning

Gabriel da S. C. Nogueira; Davi P. dos Santos; André C. P. L F. de Carvalho; Luís P. F. Garcia

Gabriel da S. C. Nogueira UnB / USP
Davi P. dos Santos UTFPR
André C. P. L F. de Carvalho USP
Luís P. F. Garcia UnB

Resumo

Predictive tasks usually require labeled data, which are often costly or impractical to acquire. Active Learning (AL) methods address this bottleneck, enabling the induction of predictive models with reduced label annotation cost. They optimize the labeling process by strategically sampling a subset of the unlabeled examples and sequentially selecting instances until a suitable labeled set is obtained. Several sampling strategies have been proposed in the literature. Although those based on Pool-based AL (PAL) scenario have presented some of the best results, they still have known limitations. This paper investigates how to address these limitations. PAL iteratively selects instances from a data pool to be labeled by an oracle and incorporated into the training set, increasing its representativeness for the current classification task. Its strategies, however, are limited by the active learners’ bias within the exploration-exploitation tradeoff. For instance, non-agnostic strategies are biased toward prospective sampling, as they depend on the learner’s assumptions. This has often resulted in suboptimal performance in the early stages of AL. To mitigate this issue, more exploratory biases have been introduced. Nevertheless, current strategies continue to exhibit shortcomings in data selection. On the other hand, recent studies have shown that identifying instances that are hard to classify significantly impacts the predictive performance of classifiers. This paper proposes a new non-agnostic PAL strategy, Hardness Sampling (HardS), which is based on Hardness Measures (HMs). HMs employ the Instance Hardness (IH) concept to identify instances with a higher probability of being misclassified. According to experiments carried out across diverse datasets, HardS is a competitive alternative to classical approaches, in particular Uncertainty Sampling, Expected Error Reduction, and Density-Weighted methods. The experimental results also suggest that some groups of HMs introduce a self-regulating balance between exploratory and prospective sampling biases, addressing such a key PAL challenge.