PromptNER: An Approach for Named Entity Recognition in Sensitive Data from Automatically Labeled Instances

  • Claudio M. V. de Andrade Federal University of Minas Gerais
  • Celso França Federal University of Minas Gerais
  • Fabiano Belém Federal University of Minas Gerais
  • Gabriel Jallais Federal University of Minas Gerais
  • Marcelo A. S. Ganem Federal University of Minas Gerais
  • Gabriel Texeira Federal University of Minas Gerais
  • Alberto H. F. Laender Federal University of Minas Gerais
  • Marcos A. Gonçalves Federal University of Minas Gerais

Abstract


In this article, we address the task of Named Entity Recognition (NER) for Organizations and Products/Services in textual complaints recorded on web platforms. Due to the high inference power of Large Language Models (LLM's), there is a growing interest in their application. However, they face issues of high infrastructure cost and privacy concerns when using external API's. Thus, we propose an approach that uses LLM's for the recognition of entities in complaints and then trains simpler models, such as the SpERT method. The enhanced NER model achieves significant gains of 41% to 129% in F-score compared to the labeled data-only model.

Keywords: Entity Recognition, Generative Model, Transforms

References

Akter, S. & Wamba, S. F. (2016). Big data analytics in E-commerce: a systematic review and agenda for future research. Electronic Markets, 26(2):173–194.

Belém, F., Ganem, M., França, C., Carvalho, M., Laender, A., & Gonçalves, M. (2022). Reforço e Delimitação Contextual para Reconhecimento de Entidades e Relações em Documentos Oficiais. In Anais do XXXVII Simpósio Brasileiro de Bancos de Dados, pages 292–303.

Brunner, U. & Stockinger, K. (2020). Entity Matching with Transformer Architectures - A Step Forward in Data Integration. In Proceedings of the International Conference on Extending Database Technology, pages 463–473.

Caputo, A., Basile, P., & Semeraro, G. (2009). Boosting a Semantic Search Engine by Named Entities. In Foundations of Intelligent Systems, pages 241–250.

de Andrade, C. M., Belém, F. M., Cunha, W., França, C., Viegas, F., Rocha, L., & Gonçalves, M. A. (2023). On the class separability of contextual embeddings representations – or “the classifier does not matter when the (text) representation is so good!”. Information Processing & Management, 60(4):103336.

Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4171–4186.

Eberts, M. & Ulges, A. (2020). Span-based Joint Entity and Relation Extraction with Transformer Pre-training. In Proceedings of the 24th European Conference on Artificial Intelligence, pages 2006–2013.

Eberts, M. & Ulges, A. (2021). An End-to-end Model for Entity-level Relation Extraction using Multi-instance Learning. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics, pages 3650–3660.

Fabbri, A. R., Kryscinski, W., McCann, B., Xiong, C., Socher, R., & Radev, D. R. (2021). Summeval: Re-evaluating summarization evaluation. Transactions of the Association for Computational Linguistics, 9:391–409.

Finkel, J. R., Grenager, T., & Manning, C. (2005). Non-local Information into Information Extraction Systems by Gibbs Sampling. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, pages 363–370.

Fu, J., Huang, X., & Liu, P. (2021). SpanNER: Named Entity Re-/Recognition as Span Prediction. In Annual Meeting of the Association for Computational Linguistics, pages 7183–7195.

Ji, B., Yu, J., Li, S., Ma, J., Wu, Q., Tan, Y., & Liu, H. (2020). Span-based Joint Entity and Relation Extraction with Attention-based Span-specific and Contextual Semantic Representations. In Proceedings of the 28th International Conference on Computational Linguistics, pages 88–99.

Liu, C., Fan, H., & Liu, J. (2021). Span-Based Nested Named Entity Recognition with Pretrained Language Model. In Jensen, C. S., Lim, E.-P., Yang, D.-N., Lee, W.-C., Tseng, V. S., Kalogeraki, V., Huang, J.-W., & Shen, C.-Y., editors, In Processing of the 26th International Conference Database Systems for Advanced Applications, pages 620–628.

Luo, X., Xue, Y., Xing, Z., & Sun, J. (2022). PRCBERT: Prompt Learning for Requirement Classification using BERT-based Pretrained Language Models. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering, pages 1–13.

Mangaravite, V., Carvalho, M., Cantelli, L., Ponce, L. M., Campoi, B., Nunes, G., Laender, A. H. F., & Gonçalves, M. A. (2022). DedupeGov: Uma Plataforma para Integração de Grandes Volumes de Dados de Pessoas Físicas e Jurídicas em Âmbito Governamental. In Anais do XXXVII Simpósio Brasileiro de Bancos de Dados, pages 90–102.

Niu, F., Zhang, C., Ré, C., & Shavlik, J. W. (2012). DeepDive: Web-scale Knowledge-base Construction using Statistical Learning and Inference. In Proceedings of the Second International Workshop on Searching and Integrating New Web Data Sources, Istanbul, Turkey, August 31, 2012, pages 25–28.

Patil, N., Patil, A., & Pawar, B. (2020). Named entity recognition using conditional random fields. Procedia Computer Science, 167:1181–1188. International Conference on Computational Intelligence and Data Science.

Silva, L., Canalle, G. K., Salgado, A. C., Lóscio, B., & Moro, M. (2019). Uma Análise Experimental do Impacto da Seleção de Atributos em Processos de Resolução de Entidades. In Anais do XXXIV Simpósio Brasileiro de Banco de Dados, pages 37–48.

Silva, R. M., Gomes, G. C. M., Alvim, M. S., & Gonçalves, M. A. (2022). How to build high quality L2R training data: Unsupervised compression-based selective sampling for learning to rank. Information Sciences, 601:90–113.

Tang, R., Han, X., Jiang, X., & Hu, X. (2023). Does synthetic data generation of llms help clinical text mining? Computer Science Archive, abs/2303.04360.

Wang, S., Sun, X., Li, X., Ouyang, R., Wu, F., Zhang, T., Li, J., & Wang, G. (2023). GPT-NER: Named Entity Recognition via Large Language Models. Computer Science Archive, abs/2304.10428.

Ye, F., Huang, L., Liang, S., & Chi, K. (2023). Decomposed Two-Stage Prompt Learning for Few-Shot Named Entity Recognition. Information, 14(5).

Zhu, Y., Ye, Y., Li, M., Zhang, J., & Wu, O. (2023). Investigating annotation noise for named entity recognition. Neural Comput. Appl., 35(1):993–1007.
Published
2023-09-25
ANDRADE, Claudio M. V. de; FRANÇA, Celso; BELÉM, Fabiano; JALLAIS, Gabriel; GANEM, Marcelo A. S.; TEXEIRA, Gabriel; LAENDER, Alberto H. F.; GONÇALVES, Marcos A.. PromptNER: An Approach for Named Entity Recognition in Sensitive Data from Automatically Labeled Instances. In: BRAZILIAN SYMPOSIUM ON DATABASES (SBBD), 38. , 2023, Belo Horizonte/MG. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2023 . p. 269-281. ISSN 2763-8979. DOI: https://doi.org/10.5753/sbbd.2023.232532.