PromptNER: Uma Abordagem para Reconhecimento de Entidades Nomeadas em Dados Sensíveis a Partir de Instâncias Rotuladas Automaticamente

Claudio M. V. de Andrade; Celso França; Fabiano Belém; Gabriel Jallais; Marcelo A. S. Ganem; Gabriel Texeira; Alberto H. F. Laender; Marcos A. Gonçalves

doi:10.5753/sbbd.2023.232532

Claudio M. V. de Andrade Universidade Federal de Minas Gerais
Celso França Universidade Federal de Minas Gerais
Fabiano Belém Universidade Federal de Minas Gerais
Gabriel Jallais Universidade Federal de Minas Gerais
Marcelo A. S. Ganem Universidade Federal de Minas Gerais
Gabriel Texeira Universidade Federal de Minas Gerais
Alberto H. F. Laender Universidade Federal de Minas Gerais
Marcos A. Gonçalves Universidade Federal de Minas Gerais

DOI: https://doi.org/10.5753/sbbd.2023.232532

Resumo

Neste artigo, abordamos a tarefa de Reconhecimento de Entidades Nomeadas (REN) nos casos de Organizações e Produtos/Serviços presentes em reclamações textuais registradas em plataformas na Web. Devido ao alto poder de inferência dos modelos de linguagem de larga escala (LLM's), há interesse crescente em sua aplicação, porém eles enfrentam problemas de alto custo de infraestrutura e privacidade ao utilizar API's externas. Assim, propomos uma abordagem que utiliza LLM's para o reconhecimento de entidades nas reclamações e que, em seguida, treina modelos mais simples, como o método SpERT. O modelo de REN aprimorado obtém ganhos significativos de 41% a 129% em F-score em comparação com o modelo de dados rotulados apenas manualmente.

Palavras-chave: Reconhecimento de Entidades, Modelo generativo, Transforms

Referências

Akter, S. & Wamba, S. F. (2016). Big data analytics in E-commerce: a systematic review and agenda for future research. Electronic Markets, 26(2):173–194.

Belém, F., Ganem, M., França, C., Carvalho, M., Laender, A., & Gonçalves, M. (2022). Reforço e Delimitação Contextual para Reconhecimento de Entidades e Relações em Documentos Oficiais. In Anais do XXXVII Simpósio Brasileiro de Bancos de Dados, pages 292–303.

Brunner, U. & Stockinger, K. (2020). Entity Matching with Transformer Architectures - A Step Forward in Data Integration. In Proceedings of the International Conference on Extending Database Technology, pages 463–473.

Caputo, A., Basile, P., & Semeraro, G. (2009). Boosting a Semantic Search Engine by Named Entities. In Foundations of Intelligent Systems, pages 241–250.

de Andrade, C. M., Belém, F. M., Cunha, W., França, C., Viegas, F., Rocha, L., & Gonçalves, M. A. (2023). On the class separability of contextual embeddings representations – or “the classifier does not matter when the (text) representation is so good!”. Information Processing & Management, 60(4):103336.

Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4171–4186.

Eberts, M. & Ulges, A. (2020). Span-based Joint Entity and Relation Extraction with Transformer Pre-training. In Proceedings of the 24th European Conference on Artificial Intelligence, pages 2006–2013.

Eberts, M. & Ulges, A. (2021). An End-to-end Model for Entity-level Relation Extraction using Multi-instance Learning. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics, pages 3650–3660.

Fabbri, A. R., Kryscinski, W., McCann, B., Xiong, C., Socher, R., & Radev, D. R. (2021). Summeval: Re-evaluating summarization evaluation. Transactions of the Association for Computational Linguistics, 9:391–409.

Finkel, J. R., Grenager, T., & Manning, C. (2005). Non-local Information into Information Extraction Systems by Gibbs Sampling. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, pages 363–370.

Fu, J., Huang, X., & Liu, P. (2021). SpanNER: Named Entity Re-/Recognition as Span Prediction. In Annual Meeting of the Association for Computational Linguistics, pages 7183–7195.

Ji, B., Yu, J., Li, S., Ma, J., Wu, Q., Tan, Y., & Liu, H. (2020). Span-based Joint Entity and Relation Extraction with Attention-based Span-specific and Contextual Semantic Representations. In Proceedings of the 28th International Conference on Computational Linguistics, pages 88–99.

Liu, C., Fan, H., & Liu, J. (2021). Span-Based Nested Named Entity Recognition with Pretrained Language Model. In Jensen, C. S., Lim, E.-P., Yang, D.-N., Lee, W.-C., Tseng, V. S., Kalogeraki, V., Huang, J.-W., & Shen, C.-Y., editors, In Processing of the 26th International Conference Database Systems for Advanced Applications, pages 620–628.

Luo, X., Xue, Y., Xing, Z., & Sun, J. (2022). PRCBERT: Prompt Learning for Requirement Classification using BERT-based Pretrained Language Models. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering, pages 1–13.

Mangaravite, V., Carvalho, M., Cantelli, L., Ponce, L. M., Campoi, B., Nunes, G., Laender, A. H. F., & Gonçalves, M. A. (2022). DedupeGov: Uma Plataforma para Integração de Grandes Volumes de Dados de Pessoas Físicas e Jurídicas em Âmbito Governamental. In Anais do XXXVII Simpósio Brasileiro de Bancos de Dados, pages 90–102.

Niu, F., Zhang, C., Ré, C., & Shavlik, J. W. (2012). DeepDive: Web-scale Knowledge-base Construction using Statistical Learning and Inference. In Proceedings of the Second International Workshop on Searching and Integrating New Web Data Sources, Istanbul, Turkey, August 31, 2012, pages 25–28.

Patil, N., Patil, A., & Pawar, B. (2020). Named entity recognition using conditional random fields. Procedia Computer Science, 167:1181–1188. International Conference on Computational Intelligence and Data Science.

Silva, L., Canalle, G. K., Salgado, A. C., Lóscio, B., & Moro, M. (2019). Uma Análise Experimental do Impacto da Seleção de Atributos em Processos de Resolução de Entidades. In Anais do XXXIV Simpósio Brasileiro de Banco de Dados, pages 37–48.

Silva, R. M., Gomes, G. C. M., Alvim, M. S., & Gonçalves, M. A. (2022). How to build high quality L2R training data: Unsupervised compression-based selective sampling for learning to rank. Information Sciences, 601:90–113.

Tang, R., Han, X., Jiang, X., & Hu, X. (2023). Does synthetic data generation of llms help clinical text mining? Computer Science Archive, abs/2303.04360.

Wang, S., Sun, X., Li, X., Ouyang, R., Wu, F., Zhang, T., Li, J., & Wang, G. (2023). GPT-NER: Named Entity Recognition via Large Language Models. Computer Science Archive, abs/2304.10428.

Ye, F., Huang, L., Liang, S., & Chi, K. (2023). Decomposed Two-Stage Prompt Learning for Few-Shot Named Entity Recognition. Information, 14(5).

Zhu, Y., Ye, Y., Li, M., Zhang, J., & Wu, O. (2023). Investigating annotation noise for named entity recognition. Neural Comput. Appl., 35(1):993–1007.