An Analysis of Public Datasets for Hierarchical Classification

Gustavo Vieira Maia; Frederico Gualberto Ferreira Coelho

doi:10.5753/latinoware.2025.16302

Gustavo Vieira Maia UFMG
Frederico Gualberto Ferreira Coelho UFMG

DOI: https://doi.org/10.5753/latinoware.2025.16302

Resumo

Hierarchical classification is a machine learning task that leverages inherent parent-child relationships between class labels and offers advantages in predictive performance and interpretability over traditional ”flat” classification. Despite its potential, its adoption in domains other than text, image and biology is limited, partly due to a perceived scarcity of suitable public datasets. This study performs an investigation into the availability of hierarchical datasets within the UCI Machine Learning Repository and OpenML.We employed a novel methodology using Large Language Models to automatically classify the metadata of over 1200 candidate datasets, followed by manual verification of promising candidates. Our findings reveal a shortage of public tabular datasets suitable for hierarchical classification. Out of the entire collection, only three potential datasets were identified. This work quantifies the data scarcity problem, highlighting it as a significant bottleneck that hinders research, development, and the broader application of hierarchical modeling techniques. To the best of our knowledge, this is the first large-scale quantitative study of hierarchical classification dataset availability in major public repositories.

Palavras-chave: Machine Learning, Hierarchical Classification, Open Datasets, Data Science

Referências

A. D. Gordon, “A review of hierarchical classification,” 1987. [Online]. Available: [link]

C. N. Silla and A. A. Freitas, “A survey of hierarchical classification across different application domains,” Data Mining and Knowledge Discovery, vol. 22, pp. 31–72, 2010. [Online]. Available: [link]

C. Vens, J. Struyf, L. Schietgat, S. Deroski, and H. Blockeel, “Decision trees for hierarchical multi-label classification,” Machine Learning, vol. 73, pp. 185–214, 2008. [Online]. Available: [link]

L. Rokach, “Ensemble-based classifiers,” Artificial Intelligence Review, vol. 33, pp. 1–39, 2010. [Online]. Available: [link]

J. Read, B. Pfahringer, G. Holmes, and E. Frank, “Classifier chains: A review and perspectives,” J. Artif. Intell. Res., vol. 70, pp. 683–718, 2019. [Online]. Available: [link]

K. Dembczynski and W. Cheng, “On label dependence in multi-label classification,” 2010. [Online]. Available: [link]

A. K. Tegegnie, A. N. Tarekegn, and T. A. Alemu, “A comparative study of flat and hierarchical classification for amharic news text using svm,” International Journal of Information Engineering and Electronic Business, vol. 9, no. 3, p. 36, 2017.

L. E. M. Guerrero, Y. F. Ceballos, and L. D. T. Rojas, “Leveraging imagenet’s hierarchical structure for enhanced image classification and retrieval,” Journal of Image and Graphics, vol. 13, no. 4, 2025.

S. Gauch, A. Chandramouli, and S. Ranganathan, “Training a hierarchical classifier using inter document relationships,” Journal of the American Society for Information Science and Technology, vol. 60, no. 1, pp. 47–58, 2009.

N. Cesa-Bianchi and G. Valentini, “Hierarchical cost-sensitive algorithms for genome-wide gene function prediction,” in Machine learning in systems biology. PMLR, 2009, pp. 14–29.

D. Ghazi, D. Inkpen, and S. Szpakowicz, “Hierarchical versus flat classification of emotions in text,” in Proceedings of the NAACL HLT 2010 workshop on computational approaches to analysis and generation of emotion in text, 2010, pp. 140–146.

F. M. Miranda, N. Köhnecke, and B. Y. Renard, “Hiclass: a python library for local hierarchical classification compatible with scikit-learn,” Journal of Machine Learning Research, vol. 24, no. 29, pp. 1–17, 2023.

F. Psallidas, Y. Zhu, B. Karlas, M. Interlandi, A. Floratou, K. Karanasos, W. Wu, C. Zhang, S. Krishnan, C. Curino, and M. Weimer, “Data science through the looking glass and what we found there,” ArXiv, vol. abs/1912.09536, 2019. [Online]. Available: [link]

A. Tschalzev, S. Marton, S. Ludtke, C. Bartelt, and H. Stuckenschmidt, “A data-centric perspective on evaluating machine learning models for tabular data,” ArXiv, vol. abs/2407.02112, 2024. [Online]. Available: [link]

R. Shwartz-Ziv and A. Armon, “Tabular data: Deep learning is not all you need,” ArXiv, vol. abs/2106.03253, 2021. [Online]. Available: [link]

A. Shmuel, O. Glickman, and T. Lazebnik, “A comprehensive benchmark of machine and deep learning across diverse tabular datasets,” ArXiv, vol. abs/2408.14817, 2024. [Online]. Available: [link]

D. C. McElfresh, S. Khandagale, J. Valverde, C. VishakPrasad, B. Feuer, C. Hegde, G. Ramakrishnan, M. Goldblum, and C. White, “When do neural nets outperform boosted trees on tabular data?” ArXiv, vol. abs/2305.02997, 2023. [Online]. Available: [link]

Y. V. Gorishniy, I. Rubachev, V. Khrulkov, and A. Babenko, “Revisiting deep learning models for tabular data,” in Neural Information Processing Systems, 2021. [Online]. Available: [link]

J. van Rijn, A. Kadra, P. Gijsbers, N. Mallik, S. Ravi, A. Müller, J. Vanschoren, and F. Hutter, “openml-python: a python api for openml,” [link], 2014.

G. K. Higuera, Clara and K. Cios, “Mice Protein Expression,” UCI Machine Learning Repository, 2015, DOI: 10.24432/C50S3Z.

P. Mah and J.-B. Veyrieras, “MicroMass,” UCI Machine Learning Repository, 2014, DOI: 10.24432/C5T61S.

A. Fansi Tchango, R. Goel, Z. Wen, J. Martel, and J. Ghosn, “Ddxplus: A new dataset for automatic medical diagnosis,” Advances in neural information processing systems, vol. 35, pp. 31 306–31 318, 2022.