Weight and Expand: Impact and Limitations of Contextual-Sparse Representations in Topic Modeling

Abstract


This work proposes using contextual-sparse representations in Topic Modeling (TM), aiming to combine the interpretability of sparse representations with the semantic power of contextual ones. Using the SPLADE model, we seek to represent documents in a context-sensitive manner through term expansion and weighting. We empirically evaluate this approach in comparison with other representations. The results indicate that term weighting enables effective TM, while term expansion, although promising, presents limitations due to the mismatch between the representation’s vocabulary and the original texts.
Keywords: Topic Modeling, Contextual-Sparse Representations, SPLADE, Textual Expansion

References

Abdelrazek, A., Eid, Y., Gawish, E., Medhat, W., and Hassan, A. (2023a). Topic modeling algorithms and applications: A survey. Information Systems, 112:102131.

Abdelrazek, A., Eid, Y., Gawish, E., Medhat, W., and Hassan, A. (2023b). Topic modeling algorithms and applications: A survey. Information Systems, 112:102131.

Arora, S., May, A., Zhang, J., and Ré, C. (2020). Contextual embeddings: When are they worth it? arXiv preprint arXiv:2005.09117.

Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent dirichlet allocation. Journal of machine Learning research, 3(Jan):993–1022.

Bouma, G. (2009). Normalized (pointwise) mutual information in collocation extraction. Proceedings of GSCL.

Boutsidis, C. and Gallopoulos, E. (2008). Svd based initialization: A head start for nonnegative matrix factorization. Pattern Recognition, 41(4):1350–1362.

Churchill, R. and Singh, L. (2022). The evolution of topic modeling. ACM Comput. Surv., 54(10s).

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American ACL: human language technologies, volume 1 (long and short papers), pages 4171–4186.

Doogan, C. and Buntine, W. (2021). Topic model or topic twaddle? re-evaluating demantic interpretability measures. In North American Association for Computational Linguistics 2021, pages 3824–3848. ACL.

Formal, T., Lassance, C., Piwowarski, B., and Clinchant, S. (2022). From distillation to hard negative sampling: Making sparse neural ir models more effective. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’22, page 2353–2359, New York, NY, USA. Association for Computing Machinery.

Gao, X., Lin, Y., Li, R., Wang, Y., Chu, X., Ma, X., and Yu, H. (2024). Enhancing topic interpretability for neural topic modeling through topic-wise contrastive learning. In 2024 IEEE 40th ICDE.

Ghahramani, Z. and Attias, H. (2000). Online variational bayesian learning. In Slides from talk presented at NIPS workshop on Online Learning.

Grootendorst, M. (2022). Bertopic: Neural topic modeling with a class-based tf-idf procedure. arXiv preprint arXiv:2203.05794.

Júnior, A. P. D. S., Cecilio, P., Viegas, F., Cunha, W., Albergaria, E. T. D., and Rocha, L. C. D. D. (2022). Evaluating topic modeling pre-processing pipelines for portuguese texts. WebMedia ’22, page 191–201.

Kuang, D., Choo, J., and Park, H. (2015). Nonnegative Matrix Factorization for Interactive Topic Modeling and Document Clustering, pages 215–243. Springer International Publishing, Cham.

Lee, D. D. and Seung, H. S. (1999). Learning the parts of objects by non-negative matrix factorization. nature, 401(6755):788–791.

Liang, P., Bommasani, R., Lee, T., Tsipras, D., Soylu, D., Yasunaga, M., Zhang, Y., Narayanan, D., Wu, Y., Kumar, A., et al. (2022). Holistic evaluation of language models. arXiv preprint arXiv:2211.09110.

Viegas, F., Canuto, S., Gomes, C., Luiz, W., Rosa, T., Ribas, S., Rocha, L., and Gonçalves, M. A. (2019). Cluwords: exploiting semantic word clustering representation for enhanced topic modeling. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, pages 753–761.

Viegas, F., Pereira, A., Cunha, W., França, C., Andrade, C., Tuler, E., Rocha, L., and Gonçalves, M. A. (2025). Exploiting contextual embeddings in hierarchical topic modeling and investigating the limits of the current evaluation metrics. Computational Linguistics, pages 1–41.
Published
2025-09-29
MACHADO, Ana Cláudia; FRANÇA, Celso; NUNES, Ian; GONÇALVES, Marcos André; ROCHA, Leonardo. Weight and Expand: Impact and Limitations of Contextual-Sparse Representations in Topic Modeling. In: BRAZILIAN SYMPOSIUM ON DATABASES (SBBD), 40. , 2025, Fortaleza/CE. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2025 . p. 928-934. ISSN 2763-8979. DOI: https://doi.org/10.5753/sbbd.2025.247809.