Skip to main content

Evaluating Contextualized Embeddings for Topic Modeling in Public Bidding Domain

  • Conference paper
  • First Online:
Intelligent Systems (BRACIS 2023)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14197))

Included in the following conference series:

  • 254 Accesses

Abstract

Public procurement plays a crucial role in government operations by acquiring goods and services through competitive bidding processes. However, the increasing volume of procurement data has made manual analysis impractical and time-consuming. Therefore, text clustering and topic modeling techniques have been widely used to uncover hidden patterns in unstructured text data. This paper leverages the power of BERT-based models to overcome the challenges associated with analyzing public procurement data. Specifically, we employ BERTopic, a topic modeling technique based on BERT, to generate clusters that capture the underlying topics in procurement data. Additionally, we evaluate several sentence embedding models for representing procurement documents. By combining BERT-based models and advanced sentence embeddings, we aim to enhance the accuracy and interpretability of topic modeling in public procurement analysis. Our results provide valuable insights into the underlying topics within the data, aiding decision-making processes and improving the efficiency of procurement operations.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://huggingface.co/dccmpmgfinalisticas/LiBERT-SE.

  2. 2.

    (10, 13, 14, 15, 16, 17, 19, 20, auto).

  3. 3.

    (10, 20, 30, 40, 50, 60, 70, 80, 90, 100).

References

  1. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)

    Google Scholar 

  2. Bouma, G.: Normalized (pointwise) mutual information in collocation extraction. Proc. GSCL 30, 31–40 (2009)

    Google Scholar 

  3. Campello, R.J.G.B., Moulavi, D., Sander, J.: Density-based clustering based on hierarchical density estimates. In: Pei, J., Tseng, V.S., Cao, L., Motoda, H., Xu, G. (eds.) PAKDD 2013. LNCS (LNAI), vol. 7819, pp. 160–172. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-37456-2_14

    Chapter  Google Scholar 

  4. Constantino, K., et al.: Segmentação e classificação semântica de trechos de diários oficiais usando aprendizado ativo. In: SBBD, pp. 304–316. SBC (2022). https://doi.org/10.5753/sbbd.2022.224656

  5. Devlin, J., et al.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL-HLT, pp. 4171–4186. Association for Computational Linguistics (2019). https://doi.org/10.18653/v1/n19-1423

  6. Dieng, A.B., Ruiz, F.J.R., Blei, D.M.: Topic modeling in embedding spaces. Trans. Assoc. Comput. Linguistics 8, 439–453 (2020). https://doi.org/10.1162/tacl_a_00325

    Article  Google Scholar 

  7. Feldman, R., Sanger, J.: The Text Mining Handbook - Advanced Approaches in Analyzing Unstructured Data. Cambridge University Press (2007)

    Google Scholar 

  8. Feng, F., et al.: Language-agnostic BERT sentence embedding. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 878–891. Association for Computational Linguistics (2022). https://doi.org/10.18653/v1/2022.acl-long.62

  9. Gao, T., Yao, X., Chen, D.: SimCSE: simple contrastive learning of sentence embeddings. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 6894–6910. Association for Computational Linguistics (2021). https://doi.org/10.18653/v1/2021.emnlp-main.552

  10. Grootendorst, M.: BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv preprint arXiv:2203.05794 (2022)

  11. McInnes, L., et al.: UMAP: uniform manifold approximation and projection. J. Open Source Softw. 3(29), 861 (2018). https://doi.org/10.21105/joss.00861

  12. Naseem, U., et al.: A comprehensive survey on word representation models: from classical to state-of-the-art word representation language models. ACM Trans. Asian Low Resour. Lang. Inf. Process. 20(5), 74:1–74:35 (2021). https://doi.org/10.1145/3434237

  13. Nikiforova, A., McBride, K.: Open government data portal usability: a user-centred usability analysis of 41 open government data portals. Telematics Inform. 58, 101539 (2021). https://doi.org/10.1016/j.tele.2020.101539

    Article  Google Scholar 

  14. Reimers, N., Gurevych, I.: Sentence-BERT: sentence Embeddings using Siamese BERT-Networks. In: EMNLP-IJCNLP, pp. 3980–3990. Association for Computational Linguistics (2019). https://doi.org/10.18653/v1/D19-1410

  15. Silva, M., et al.: LiPSet: um conjunto de dados com documentos rotulados de licitações públicas. In: Anais do IV Dataset Showcase Workshop, pp. 13–24. SBC, Porto Alegre, RS, Brasil (2022). https://doi.org/10.5753/dsw.2022.224925

  16. Silva, N.F.F., et al.: Evaluating topic models in Portuguese political comments about bills from Brazil’s chamber of deputies. In: Britto, A., Valdivia Delgado, K. (eds.) BRACIS 2021. LNCS (LNAI), vol. 13074, pp. 104–120. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-91699-2_8

    Chapter  Google Scholar 

  17. Silveira, R., et al.: Topic modelling of legal documents via legal-BERT. CEUR Workshop Proc. 1613, 0073 (2021)

    Google Scholar 

  18. Souza, F., Nogueira, R., Lotufo, R.: BERTimbau: pretrained BERT models for Brazilian Portuguese. In: Cerri, R., Prati, R.C. (eds.) BRACIS 2020. LNCS (LNAI), vol. 12319, pp. 403–417. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-61377-8_28

    Chapter  Google Scholar 

  19. Souza Júnior, A.P., et al.: Evaluating topic modeling pre-processing pipelines for Portuguese texts. In: WebMedia, pp. 191–201. ACM (2022)

    Google Scholar 

  20. Turian, J.P., Ratinov, L., Bengio, Y.: Word representations: a simple and general method for semi-supervised learning. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 384–394. The Association for Computer Linguistics (2010)

    Google Scholar 

  21. Yang, Y., et al.: Multilingual universal sentence encoder for semantic retrieval. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations (ACL), pp. 87–94. Association for Computational Linguistics (2020). https://doi.org/10.18653/v1/2020.acl-demos.12

Download references

Acknowledgments.

This work was funded by the Prosecution Service of the State of Minas Gerais (Ministério Público do Estado de Minas Gerais) through the Analytical Capabilities Project (Programa de Capacidades Analíticas) and by CNPq, CAPES, and FAPEMIG.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Gisele Pappa .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Hott, H.R., Silva, M.O., Oliveira, G.P., Brandão, M.A., Lacerda, A., Pappa, G. (2023). Evaluating Contextualized Embeddings for Topic Modeling in Public Bidding Domain. In: Naldi, M.C., Bianchi, R.A.C. (eds) Intelligent Systems. BRACIS 2023. Lecture Notes in Computer Science(), vol 14197. Springer, Cham. https://doi.org/10.1007/978-3-031-45392-2_27

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-45392-2_27

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-45391-5

  • Online ISBN: 978-3-031-45392-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics