Abstract
Public procurement plays a crucial role in government operations by acquiring goods and services through competitive bidding processes. However, the increasing volume of procurement data has made manual analysis impractical and time-consuming. Therefore, text clustering and topic modeling techniques have been widely used to uncover hidden patterns in unstructured text data. This paper leverages the power of BERT-based models to overcome the challenges associated with analyzing public procurement data. Specifically, we employ BERTopic, a topic modeling technique based on BERT, to generate clusters that capture the underlying topics in procurement data. Additionally, we evaluate several sentence embedding models for representing procurement documents. By combining BERT-based models and advanced sentence embeddings, we aim to enhance the accuracy and interpretability of topic modeling in public procurement analysis. Our results provide valuable insights into the underlying topics within the data, aiding decision-making processes and improving the efficiency of procurement operations.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
(10, 13, 14, 15, 16, 17, 19, 20, auto).
- 3.
(10, 20, 30, 40, 50, 60, 70, 80, 90, 100).
References
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
Bouma, G.: Normalized (pointwise) mutual information in collocation extraction. Proc. GSCL 30, 31–40 (2009)
Campello, R.J.G.B., Moulavi, D., Sander, J.: Density-based clustering based on hierarchical density estimates. In: Pei, J., Tseng, V.S., Cao, L., Motoda, H., Xu, G. (eds.) PAKDD 2013. LNCS (LNAI), vol. 7819, pp. 160–172. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-37456-2_14
Constantino, K., et al.: Segmentação e classificação semântica de trechos de diários oficiais usando aprendizado ativo. In: SBBD, pp. 304–316. SBC (2022). https://doi.org/10.5753/sbbd.2022.224656
Devlin, J., et al.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL-HLT, pp. 4171–4186. Association for Computational Linguistics (2019). https://doi.org/10.18653/v1/n19-1423
Dieng, A.B., Ruiz, F.J.R., Blei, D.M.: Topic modeling in embedding spaces. Trans. Assoc. Comput. Linguistics 8, 439–453 (2020). https://doi.org/10.1162/tacl_a_00325
Feldman, R., Sanger, J.: The Text Mining Handbook - Advanced Approaches in Analyzing Unstructured Data. Cambridge University Press (2007)
Feng, F., et al.: Language-agnostic BERT sentence embedding. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 878–891. Association for Computational Linguistics (2022). https://doi.org/10.18653/v1/2022.acl-long.62
Gao, T., Yao, X., Chen, D.: SimCSE: simple contrastive learning of sentence embeddings. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 6894–6910. Association for Computational Linguistics (2021). https://doi.org/10.18653/v1/2021.emnlp-main.552
Grootendorst, M.: BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv preprint arXiv:2203.05794 (2022)
McInnes, L., et al.: UMAP: uniform manifold approximation and projection. J. Open Source Softw. 3(29), 861 (2018). https://doi.org/10.21105/joss.00861
Naseem, U., et al.: A comprehensive survey on word representation models: from classical to state-of-the-art word representation language models. ACM Trans. Asian Low Resour. Lang. Inf. Process. 20(5), 74:1–74:35 (2021). https://doi.org/10.1145/3434237
Nikiforova, A., McBride, K.: Open government data portal usability: a user-centred usability analysis of 41 open government data portals. Telematics Inform. 58, 101539 (2021). https://doi.org/10.1016/j.tele.2020.101539
Reimers, N., Gurevych, I.: Sentence-BERT: sentence Embeddings using Siamese BERT-Networks. In: EMNLP-IJCNLP, pp. 3980–3990. Association for Computational Linguistics (2019). https://doi.org/10.18653/v1/D19-1410
Silva, M., et al.: LiPSet: um conjunto de dados com documentos rotulados de licitações públicas. In: Anais do IV Dataset Showcase Workshop, pp. 13–24. SBC, Porto Alegre, RS, Brasil (2022). https://doi.org/10.5753/dsw.2022.224925
Silva, N.F.F., et al.: Evaluating topic models in Portuguese political comments about bills from Brazil’s chamber of deputies. In: Britto, A., Valdivia Delgado, K. (eds.) BRACIS 2021. LNCS (LNAI), vol. 13074, pp. 104–120. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-91699-2_8
Silveira, R., et al.: Topic modelling of legal documents via legal-BERT. CEUR Workshop Proc. 1613, 0073 (2021)
Souza, F., Nogueira, R., Lotufo, R.: BERTimbau: pretrained BERT models for Brazilian Portuguese. In: Cerri, R., Prati, R.C. (eds.) BRACIS 2020. LNCS (LNAI), vol. 12319, pp. 403–417. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-61377-8_28
Souza Júnior, A.P., et al.: Evaluating topic modeling pre-processing pipelines for Portuguese texts. In: WebMedia, pp. 191–201. ACM (2022)
Turian, J.P., Ratinov, L., Bengio, Y.: Word representations: a simple and general method for semi-supervised learning. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 384–394. The Association for Computer Linguistics (2010)
Yang, Y., et al.: Multilingual universal sentence encoder for semantic retrieval. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations (ACL), pp. 87–94. Association for Computational Linguistics (2020). https://doi.org/10.18653/v1/2020.acl-demos.12
Acknowledgments.
This work was funded by the Prosecution Service of the State of Minas Gerais (Ministério Público do Estado de Minas Gerais) through the Analytical Capabilities Project (Programa de Capacidades Analíticas) and by CNPq, CAPES, and FAPEMIG.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Hott, H.R., Silva, M.O., Oliveira, G.P., Brandão, M.A., Lacerda, A., Pappa, G. (2023). Evaluating Contextualized Embeddings for Topic Modeling in Public Bidding Domain. In: Naldi, M.C., Bianchi, R.A.C. (eds) Intelligent Systems. BRACIS 2023. Lecture Notes in Computer Science(), vol 14197. Springer, Cham. https://doi.org/10.1007/978-3-031-45392-2_27
Download citation
DOI: https://doi.org/10.1007/978-3-031-45392-2_27
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-45391-5
Online ISBN: 978-3-031-45392-2
eBook Packages: Computer ScienceComputer Science (R0)