A Thorough Exploitation of Distance-Based Meta-Features for Automated Text Classification

Sergio Canuto; Marcos André Gonçalves; Thierson Couto Rosa

doi:10.5753/sbbd_estendido.2021.18184

Sergio Canuto Universidade Federal de Minas Gerais (UFMG)
Marcos André Gonçalves Universidade Federal de Minas Gerais (UFMG)
Thierson Couto Rosa Universidade Federal de Goiás (UFG)

DOI: https://doi.org/10.5753/sbbd_estendido.2021.18184

Resumo

The definition of a set of informative features capable of representing and discriminating documents is paramount for the task of automatically classifying documents. In this doctoral dissertation, we present the most comprehensive study so far on the role of meta-features (high-level features built from lower-level ones) as an alternative for representing documents. We start by proposing new sets of (meta-)features that exploit distance measures in the original (bag-of-words) feature space to summarize potentially complex relationships between documents. We then (i) analyze the discriminative power of such meta-features with novel multi-objective feature selection strategies; (ii) provide new GPU implementations to reduce computational time; (iii) enrich distance relationships with labeled or context-specific information; (iv) adapt the proposed meta-features for tasks as hard as sentiment analysis. Our experimental results show that our meta-features can achieve remarkable classification results by distance exploitation, being the state-of-the-art in many situations and scenarios.

Palavras-chave: meta-features, text classification, distance-based

Referências

Canuto, S., Gonçalves, M. A., and Benevenuto, F. (2016). Exploiting new sentiment-based meta-level features for effective sentiment analysis. In WSDM, pages 53–62. ACM.

Canuto, S., Marcos, G., Santos,W., Rosa, T., andWellington, M. (2015). Efficient and scalable metafeaturebased document classification using massively parallel computing. In SIGIR, pages 333–342.

Canuto, S., Salles, T., Gonçalves, M. A., Rocha, L., Ramos, G., Gonçalves, L., Rosa, T., and Martins, W. (2014). On efficient meta-level features for effective text classification. In CIKM, pages 1709–1718.

Canuto, S., Salles, T., Rosa, T. C., and Gonçalves, M. A. (2019). Similarity-based synthetic document representations for meta-feature generation in text classification. In SIGIR, pages 355–364. ACM.

Canuto, S., Sousa, D. X., Goncalves, M. A., and Rosa, T. C. (2018). A thorough evaluation of distancebased meta-features for automated text classification. IEEE TKDE, 30:2242–2256.

Cunha, W., Canuto, S., Rosa, T., Gonçalves, M. A., and Rocha, L. (2020). Extended pre-processing pipeline for text classification: On the role of meta-feature representations, sparsification and selective sampling. Information Processing & Management, 57(4):32.

Pang, G., Jin, H., and Jiang, S. (2015). Cenknn: a scalable and effective text classifier. Data Mining and Knowledge Discovery, 29(3):593–625.

Penha, G., Campos, R. R., Canuto, S. D., Gonçalves, M. A., and Santos, R. L. T. (2019). Document performance prediction for automatic text classification. In ECIR, volume 11438, pages 132–139.

Sousa, D., Canuto, S., Gonçalves, M. A., Rosa, T., and Martins, W. (2019). Risk-sensitive learning to rank with evolutionary multi-objective feature selection. ACM Trans. Inf. Syst., 37(2):24:1–24:34.

Sousa, D., Canuto, S., Rosa, T., Martins,W., and Gonçalves, M. A. (2016). Incorporating risk-sensitiveness into feature selection for learning to rank. In CIKM, pages 257–266, New York, NY, USA. ACM.

Viegas, F., Canuto, S., Gomes, C., Luiz, W., Rosa, T., Ribas, S., Rocha, L., and Gonçalves, M. A. (2019). Cluwords: Exploiting semantic word clustering representation. In WSDM, pages 753–761.

Yang, Y. and Gopal, S. (2012). Multilabel classification with meta-level features in a learning-to-rank framework. JMLR, 88:47–68.