Characterizing instance hardness in classification and regression problems

  • Gustavo P. Torquette Universidade Federal de São Paulo
  • Victor S. Nunes Instituto Tecnologico de Aeronáutica
  • Pedro Y. A. Paiva Instituto Tecnologico de Aeronáutica
  • Lourenço B. Cunha Neto Instituto Tecnologico de Aeronáutica
  • Ana C. Lorena Instituto Tecnologico de Aeronáutica

Resumo


Some recent pieces of work in the Machine Learning (ML) literature have demonstrated the usefulness of assessing which observations are hardest to have their label predicted accurately. By identifying such instances, one may inspect whether they have any quality issues that should be addressed. Learning strategies based on the difficulty level of the observations can also be devised. This paper presents a set of meta-features that aim at characterizing which instances of a dataset are hardest to have their label predicted accurately and why they are so, aka instance hardness measures. Both classification and regression problems are considered. Synthetic datasets with different levels of complexity are built and analyzed. A Python package containing all implementations is also provided.
Palavras-chave: Data complexity, Instance Hardness, Hardness Measures, Machine Learning

Referências

Amasyali, M. F. and Ersoy, O. K. A study of meta learning for regression. ECE Technical Reports, 2009.

Arruda, J. L., Prudêncio, R. B., and Lorena, A. C. Measuring instance hardness using data complexity measures. In Brazilian Conference on Intelligent Systems. Springer, pp. 483–497, 2020.

Leisch, F., Dimitriadou, E., Leisch, M. F., and No, Z. Package ‘mlbench’. CRAN, 2009.

Leyva, E., González, A., and Pérez, R. A set of complexity measures designed for applying meta-learning to instance selection. IEEE Transactions on Knowledge and Data Engineering 27 (2): 354–367, 2014.

Lorena, A. C., Garcia, L. P., Lehmann, J., Souto, M. C., and Ho, T. K. How complex is your classification problem? a survey on measuring classification complexity. ACM Computing Surveys (CSUR) 52 (5): 1–34, 2019.

Lorena, A. C., Maciel, A. I., de Miranda, P. B., Costa, I. G., and Prudêncio, R. B. Data complexity meta-features for regression problems. Machine Learning 107 (1): 209–246, 2018.

Paiva, P. Y. A., Moreno, C. C., Smith-Miles, K., Valeriano, M. G., and Lorena, A. C. Relating instance hardness to classification performance in a dataset: a visual approach. Machine Learning 111 (8): 3085–3123, 2022.

Rivolli, A., Garcia, L. P., Soares, C., Vanschoren, J., and de Carvalho, A. C. Meta-features for meta-learning. Knowledge-Based Systems, 2022.

Schweighofer, E. Data-centric machine learning: Improving model performance and understanding through dataset analysis. In Legal Knowledge and Information Systems: JURIX 2021. Vol. 346. IOS Press, pp. 54, 2021.

Smith, M. R., Martinez, T., and Giraud-Carrier, C. An instance level analysis of data complexity. Machine learning 95 (2): 225–256, 2014.
Publicado
28/11/2022
TORQUETTE, Gustavo P.; NUNES, Victor S.; PAIVA, Pedro Y. A.; NETO, Lourenço B. Cunha; LORENA, Ana C.. Characterizing instance hardness in classification and regression problems. In: SYMPOSIUM ON KNOWLEDGE DISCOVERY, MINING AND LEARNING (KDMILE), 10. , 2022, Campinas/SP. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2022 . p. 178-185. ISSN 2763-8944. DOI: https://doi.org/10.5753/kdmile.2022.227758.