Segmentation and Summarization for Extracting Information about Information Technology Equipment from Government Procurement Notice

  • Erick Correia Silva UFC
  • Ivo Paixão de Medeiros UFG
  • Maria Viviane de Menezes UFC
  • Dayse Simon Landim Kamikawachi UFG

Resumo


Government procurement in Brazil employs a bidding process to acquire products and services, involving stages such as the publication of public notices, which are structured documents outlining procurement rules and specifications. For Information Technology (IT) companies, competitive participation in the bidding process includes monitoring opportunities by analyzing data from these notices. This paper applies text segmentation and summarization algorithms to extract data such as product names, prices and quantities from IT procurement notices. Four architectures are proposed: (i) sentence-based segmentation followed by K-means clustering; (ii) section-based segmentation followed by K-means clustering; (iii) sentence-based segmentation followed by BERTimbau clustering; and (iv) section-based segmentation followed by BERTimbau clustering. For all texts clustered as an interest profile, the Large Language Model (LLM) GPT-3.5 is applied in order to summarize and organize the information regarding product names, prices and quantities. Evaluation using real public notices from Federal and State Government Procurement sites shows that BERTimbau significantly outperformed K-means in both sentence and section segmentation tasks.

Palavras-chave: Segmentation, Summarization, Government Procurement

Referências

Ahmed, M., Seraj, R., and Islam, S. M. S. The k-means algorithm: A comprehensive survey and performance evaluation. Electronics 9 (8): 1295, 2020.

ANDRADE, S. and BAPTISTA, C. d. S. Uso de processamento de linguagem natural e aprendizagem de máquina para a extração de informação em editais de licitações não-estruturados. In Universidade Federal de Campina Grande. UFCG, 2022.

Awasthi, I., Gupta, K., Bhogal, P. S., Anand, S. S., and Soni, P. K. Natural language processing (nlp) based text summarization - a survey. In 2021 6th International Conference on Inventive Computation Technologies (ICICT). pp. 1310–1317, 2021.

Cho, S., Song, K., Wang, X., Liu, F., and Yu, D. Toward unifying text segmentation and long document summarization. arXiv preprint arXiv:2210.16422 , 2022.

da República, P. Lei de licitações e contratos administrativos, 2021. da Silva, F., Guimarães, G. M. C., Marcacini, R. M., Queiroz, A. L., Borges, V. R. P., Faleiros, T. d. P., and Garcia, L. P. F. Named entity recognition approaches applied to legal document segmentation. In Anais do X Symposium on Knowledge Discovery, Mining and Learning. SBC, pp. 210–217, 2022.

da Silva, F., Guimarães, G., Marcacini, R., Queiroz, A., Borges, V. R. P., Faleiros, T., and Garcia, L. Named entity recognition approaches applied to legal document segmentation. In Anais do X Symposium on Knowledge Discovery, Mining and Learning. SBC, Porto Alegre, RS, Brasil, pp. 210–217, 2022.

da União, C.-G. Portal da transparência, 2024. Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding, 2019. dos Santos Chaves, E. Aspectos importantes da fase interna da licitação: uma análise sobre o conjunto de elementos necessários e suficientes para a caracterização do objeto do processo licitatório. Revista Controle: Doutrinas e artigos 13 (1): 149–170, 2015.

Glavaš, G., Nanni, F., and Ponzetto, S. P. Unsupervised text segmentation using semantic relatedness graphs. In Proceedings of the Fifth Joint Conference on Lexical and Computational Semantics, C. Gardent, R. Bernardi, and I. Titov (Eds.). Association for Computational Linguistics, Berlin, Germany, pp. 125–130, 2016.

He, P., Liu, X., Gao, J., and Chen, W. Deberta: Decoding-enhanced bert with disentangled attention. arXiv preprint arXiv:2006.03654 , 2020.

Hearst, M. A. Texttiling: segmenting text into multi-paragraph subtopic passages. Comput. Linguist. 23 (1): 33–64, mar, 1997.

Jurafsky, D. and Martin, J. H. Speech and Language Processing. Pearson, 2023. In preparation. Draft chapters available at: [link].

LaValley, M. P. Logistic regression. Circulation 117 (18): 2395–2399, 2008.

Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov, V. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 , 2019.

McCulloch, W. S. (1943) warren s. mcculloch and walter pitts a logical calculus of the ideas immanent in nervous activity bulletin of mathematical biophysics 5: 115-133. Bulletin of mathematical biophysics vol. 5, pp. 115–133, 1943.

OpenAI. Gpt-4 technical report, 2023. Accessed: 2024-09-27.

Russell, S. J. and Norvig, P. Artificial intelligence: a modern approach. Pearson, 2016.

Schröer, C., Kruse, F., and Gómez, J. M. A systematic literature review on applying crisp-dm process model. Procedia Computer Science vol. 181, pp. 526–534, 2021.

Somvanshi, M., Chavan, P., Tambade, S., and Shinde, S. A review of machine learning techniques using decision tree and support vector machine. In 2016 international conference on computing communication control and automation (ICCUBEA). IEEE, pp. 1–7, 2016.

Souza, F., Nogueira, R., and Lotufo, R. Bertimbau: Pretrained bert models for brazilian portuguese. In Intelligent Systems, R. Cerri and R. C. Prati (Eds.). Springer International Publishing, Cham, pp. 403–417, 2020.

Su, X., Yan, X., and Tsai, C.-L. Linear regression. Wiley Interdisciplinary Reviews: Computational Statistics 4 (3): 275–294, 2012.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention is all you need, 2023.

Watkins, C. J. and Dayan, P. Q-learning. Machine learning vol. 8, pp. 279–292, 1992.

Yenduri, G., Ramalingam, M., Selvi, G. C., Supriya, Y., Srivastava, G., Maddikunta, P. K. R., Raj, G. D., Jhaveri, R. H., Prabadevi, B., Wang, W., et al. Gpt (generative pre-trained transformer)–a comprehensive review on enabling technologies, potential applications, emerging challenges, and future directions. IEEE Access, 2024.
Publicado
17/11/2024
SILVA, Erick Correia; MEDEIROS, Ivo Paixão de; MENEZES, Maria Viviane de; KAMIKAWACHI, Dayse Simon Landim. Segmentation and Summarization for Extracting Information about Information Technology Equipment from Government Procurement Notice. In: SYMPOSIUM ON KNOWLEDGE DISCOVERY, MINING AND LEARNING (KDMILE), 12. , 2024, Belém/PA. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2024 . p. 145-152. ISSN 2763-8944. DOI: https://doi.org/10.5753/kdmile.2024.244753.