Adapting Large Language Models for Topic Modeling Tasks

Daniel Carvalho; Antônio Pereira; Elisa Tuler; Diego Dias; Washington Cunha; Leonardo Rocha

doi:10.5753/eniac.2024.245035

Daniel Carvalho UFSJ
Antônio Pereira UFSJ
Elisa Tuler UFSJ
Diego Dias UFES
Washington Cunha UFMG
Leonardo Rocha UFSJ

DOI: https://doi.org/10.5753/eniac.2024.245035

Resumo

This work presents a proposal for adapting Large Language Models (LLMs) to the unsupervised task of Topic Modeling (TM). Our proposal consists of three stages: document summarization, characterization of topics, and definition of topics. We instantiated our proposal with two LLMs, one open-source (Llama3) and the other proprietary (GPT 3.5), comparing them with four state-of-the-art (SOTA) strategies in TM. Our results demonstrated that the approach is very promising, having been able to define topics as coherent as SOTA strategies but still with room for improvement in terms of organizational structure.

Palavras-chave: Modelagem de Tópicos, Grandes Modelos de Linguagem, Processamento de Linguagem Natural

Referências

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. (2023). Gpt-4 technical report. arXiv preprint arXiv:2303.08774.

Aziz, S., Dowling, M., Hammami, H., and Piepenbrink, A. (2022). Machine learning in finance: A topic modeling approach. EFM, 28(3):744–770.

Bouma, G. (2009). Normalized (pointwise) mutual information in collocation extraction. Proceedings of GSCL, 30:31–40.

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., and Amodei, D. (2020). Language models are few-shot learners. In Advances in Neural Information Processing Systems.

Caliński, T. and Harabasz, J. (1974). A dendrite method for cluster analysis. Communications in Statistics-theory and Methods, 3(1):1–27.

Dieng, A. B., Ruiz, F. J. R., and Blei, D. M. (2019). Topic modeling in embedding spaces. CoRR, abs/1907.04907.

El-Gayar, O., Al-Ramahi, M., Wahbeh, A., Nasralah, T., and Elnoshokaty, A. (2024). A comparative analysis of the interpretability of lda and llm for topic modeling: The case of healthcare apps. In AMCIS 2024 Proceedings.

Ezugwu, A. E., Ikotun, A. M., Oyelade, O. O., Abualigah, L., Agushaka, J. O., Eke, C. I., and Akinyelu, A. A. (2022). A comprehensive survey of clustering algorithms: State-of-the-art machine learning applications, taxonomy, challenges. Engineering Applications of Artificial Intelligence.

Grootendorst, M. (2022). Bertopic: Neural topic modeling with a class-based tf-idf procedure. arXiv preprint arXiv:2203.05794.

Hofmann, T. (1999). Probabilistic latent semantic indexing. In SIGIR.

Júnior, A. P. D. S., Cecilio, P., Viegas, F., Cunha, W., Albergaria, E. T. D., and Rocha, L. C. D. D. (2022). Evaluating topic modeling pre-processing pipelines for portuguese texts. In Proceedings of the Brazilian Symposium on Multimedia and the Web, WebMedia ’22, page 191–201.

Júnior, A. P. D. S., Viegas, F., Gonçalves, M. A., and Rocha, L. (2023). Evaluating the limits of the current evaluation metrics for topic modeling. In Proceedings of the 29th Brazilian Symposium on Multimedia and the Web, WebMedia 2023, Ribeirão Preto, Brazil, October 23-27, 2023, pages 119–127. ACM.

Liang, P., Bommasani, R., Lee, T., Tsipras, D., Soylu, D., Yasunaga, M., Zhang, Y., Narayanan, D., Wu, Y., Kumar, A., Newman, B., Yuan, B., Yan, B., Zhang, C., Cosgrove, C. A., Manning, C. D., Re, C., Acosta-Navas, D., Hudson, D. A., Zelikman, E., Durmus, E., Ladhak, F., Rong, F., Ren, H., Yao, H., WANG, J., Santhanam, K., Orr, L., Chatterji, N. S., Khattab, O., Henderson, P., Huang, Q., Chi, R. A., Xie, S. M., Santurkar, S., Ganguli, S., Hashimoto, T., Icard, T., Zhang, T., Chaudhary, V., Wang, W., Li, X., Mai, Y., Zhang, Y., and Koreeda, Y. (2023). Holistic evaluation of language models. TMLR. Featured Certification, Expert Certification.

Luiz, W., Viegas, F., Alencar, R., Mourão, F., Salles, T., Carvalho, D., Gonçalves, M. A., and Rocha, L. (2018). A feature-oriented sentiment rating for mobile app reviews. In Proceedings of the 2018 world wide web conference.

Ma, Z., Dou, Z., Xu, W., Zhang, X., Jiang, H., Cao, Z., and Wen, J.-R. (2021). Pre-training for ad-hoc retrieval: hyperlink is also you need. In CIKM.

MacAvaney, S., Nardini, F. M., Perego, R., Tonellotto, N., Goharian, N., and Frieder, O. (2020). Efficient document re-ranking for transformers by precomputing term representations. In SIGIR.

Mikolov, T., Grave, E., Bojanowski, P., Puhrsch, C., and Joulin, A. (2018). Advances in pre-training distributed word representations. In LREC.

Mu, Y., Dong, C., Bontcheva, K., and Song, X. (2024). Large language models offer an alternative to the traditional approach of topic modelling. In LREC-COLING.

Ng, A. (2017). Machine learning yearning. URL: [link], 139.

Nikolenko, S., Koltsov, S., and Koltsova, O. (2015). Topic modelling for qualitative studies. Journal of Information Science, 43.

Pham, C. M., Hoyle, A., Sun, S., and Iyyer, M. (2023). Topicgpt: A prompt-based topic modeling framework. arXiv preprint arXiv:2311.01449.

Porturas, T. and Taylor, R. A. (2021). Forty years of emergency medicine research: Uncovering research themes and trends through topic modeling. The American Journal of Emergency Medicine, 45:213–220.

Rijcken, E., Scheepers, F., Zervanou, K., Spruit, M., Mosteiro, P., and Kaymak, U. (2023). Towards interpreting topic models with chatgpt. The 20th World Congress of the International Fuzzy Systems Association, IFSA ; Conference date: 20-08-2023 Through 24-08-2023.

Rousseeuw, P. J. (1987). Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20:53–65.

Singhal, K., Tu, T., Gottweis, J., Sayres, R., Wulczyn, E., Hou, L., Clark, K., Pfohl, S., Cole-Lewis, H., Neal, D., et al. (2023). Towards expert-level medical question answering with large language models. arXiv preprint arXiv:2305.09617.

Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., and Lample, G. (2023). Llama: Open and efficient foundation language models.

Viegas, F., Canuto, S., Gomes, C., Luiz, W., Rosa, T., Ribas, S., Rocha, L., and Gonçalves, M. A. (2019). Cluwords: exploiting semantic word clustering representation for enhanced topic modeling. In WSDM, pages 753–761.

Viegas, F., Cunha, W., Gomes, C., Pereira, A., Rocha, L., and Goncalves, M. (2020). Cluhtm-semantic hierarchical topic modeling based on cluwords. In ACL, pages 8138–8150.

Viegas, F., Júnior, A. P. D. S., Cecilio, P., Tuler, E., Jr., W. M., Gonçalves, M. A., and Rocha, L. (2022). Semantic academic profiler (SAP): a framework for researcher assessment based on semantic topic modeling. Scientometrics, 127(8):5005–5026.

Viegas, F., Luiz, W., Gomes, C., Khatibi, A., Canuto, S., Mourão, F., Salles, T., Rocha, L., and Gonçalves, M. A. (2018). Semantically-enhanced topic modeling. In Proceedings of the 27th ACM international conference on information and knowledge management, pages 893–902.

Adapting Large Language Models for Topic Modeling Tasks

Resumo

Referências

Artigos mais lidos do(s) mesmo(s) autor(es)