Re-Think, Retrieve and Re-Ranking: A New Complementary Two-stage Retrieving and Re-ranking Pipeline for Extreme Multi-Label Text Classification
Abstract
Extreme multi-label text classification (XMTC) involves assigning relevant labels to text from a huge space of labels. Addressing the core challenges of XMTC (volume, imbalance and quality), we propose xCoRetriev, a two-stage pipeline migrating from a classification perspective to an information retrieval (IR) approach. We address the volume challenge by efficiently combining IR methods; we face the imbalance challenge by better capturing the text-label relationship and; we've improved quality by enriching tag names with pseudo-labels. Our results demonstrate the strengths of xCoRetriev when compared to baselines in terms of: (i) scalability for large spaces of labels and amount of text; (ii) effectiveness in the face of high imbalance, especially for predicting infrequent labels – with gains of up to 40% in MRR and NDCG –; and (iii) ability to handle poor quality annotated text and labels.
References
Ge, Y. et al. (2022). Explainable fairness in recommendation. In SIGIR, SIGIR ’22, page 681–691, New York, NY, USA. ACM.
Hou, R. et al. (2022). Contrastive-weighted self-supervised model for long-tailed data classification with vision transformer augmented. Mechanical Systems and Signal Processing, 177:109174.
Huang, X. and Wu, F. (2019). A novel topic-based framework for recommending long tail products. Computers & Industrial Engineering, 137:106063.
Jain, H. et al. (2016). Extreme multi-label loss functions for recommendation, tagging, ranking and other missing label applications. In KDD, page 935–944, New York, NY, USA. ACM.
Jiang, T. et al. (2021). Lightxml: Transformer with dynamic negative sampling for high-performance extreme multi-label text classification. In AAAI, volume 35, pages 7987–7994.
Wang, J., Chen, Z., Qin, Y., He, D., and Lin, F. (2023). Multi-aspect co-attentional collaborative filtering for extreme multi-label text classification. KBS, 260(2):1–11.
Wei, T., Mao, Z., Shi, J.-X., Li, Y.-F., and Zhang, M.-L. (2022). A survey on extreme multi-label learning. arXiv.
Xiong, J., Yu, L., Niu, X., and Leng, Y. (2023). Xrr: Extreme multi-label text classification with candidate retrieving and deep ranking. Information Sciences, 622:115–132.
You, R. et al. (2019). Attentionxml: Label tree-based attention-aware deep model for high-performance extreme multi-label text classification. In Wallach, H. et al., editors, NIPS, volume 32. Curran Associates, Inc.
Yu, H.-F. et al. (2022). Pecos: Prediction for enormous and correlated output spaces. In SIGKDD, KDD ’22, page 4848–4849, New York, NY, USA. ACM.
Zeng, J. et al. (2023). Personalized dynamic attention multi-task learning model for document retrieval and query generation. ESA, 213:119026.
