PetroGeoNER: A Refined and Unified Dataset for NER in the Oil & Gas Domain
Abstract
Named Entity Recognition (NER) is a task of Natural Language Processing (NLP) that deals with finding and categorizing relevant entities (i.e., word n-grams) in a text, assigning them to predefined semantic categories. The availability of annotated datasets is crucial for developing NER models and assessing their quality. This becomes an issue considering underrepresented languages and specific domains. Furthermore, the word-level annotation required by NER datasets is laborious and prone to inconsistencies. Aiming to contribute to more resources for Portuguese, this paper compiled PetroGeoNER, a NER dataset in the Oil & Gas domain. The process of creating our dataset involved unifying, revising, and solving inconsistencies in two existing datasets. PetroGeoNER was used to train accurate NER models. Both the models and the dataset were made publicly available.
References
Dhananjay Ashok and Zachary C Lipton. Promptner: Prompting for named entity recognition. arXiv preprint arXiv:2305.15444, 2023.
Ilias Chalkidis, Manos Fergadiotis, Prodromos Malakasiotis, Nikolaos Aletras, and Ion Androutsopoulos. Legal-bert: The muppets straight out of law school. arXiv preprint arXiv:2010.02559, 2020.
Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. Unsupervised cross-lingual representation learning at scale. CoRR, abs/1911.02116, 2019. URL [link].
Bernardo Consoli, Joaquim Santos, Diogo Gomes, Fabio Cordeiro, Renata Vieira, and Viviane Moreira. Embeddings for named entity recognition in geoscience portuguese literature. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 4625–4630, 2020.
Fábio Corrêa Cordeiro. Petro KGraph: a methodology for extracting knowledge graph from technical documents an application in the oil and gas industry. PhD thesis, Fundação Getulio Vargas, Escola de Matemática Aplicada, 2024. URL [link].
Fábio Corrêa Cordeiro, Patrícia Ferreira da Silva, Alexandre Tessarollo, Cláudia Freitas, Elvis de Souza, Diogo da Silva Magalhaes Gomes, Renato Rocha Souza, and Flávio Codeço Coelho. Petro nlp: Resources for natural language processing and information extraction for the oil and gas industry. Computers & Geosciences, 193: 105714, 2024.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, June 2019.
Tome Eftimov, Barbara Koroušić Seljak, and Peter Korošec. A rule-based named-entity recognition method for knowledge extraction of evidence-based dietary recommendations. PloS one, 12(6):e0179488, 2017.
Majigsuren Enkhsaikhan, Wei Liu, Eun-Jung Holden, and Paul Duuring. Auto-labelling entities in low-resource text: a geological case study. Knowl. Inf. Syst., 63(3):695–715, March 2021. ISSN 0219-1377. DOI: 10.1007/s10115-020-01532-6.
Cláudia Freitas, Elvis Souza, Maria Clara Castro, Tatiana Cavalcanti, Patricia Ferreira da Silva, and Fábio Corrêa Cordeiro. Recursos linguísticos para o pln específico de domínio: o petrolês. Linguamática, 15(2):51–68, Dez. 2023. DOI: 10.21814/lm.15.2.412. URL [link].
Diogo da Silva Magalhães Gomes, Fábio Corrêa Cordeiro, Bernardo Scapini Consoli, Nikolas Lacerda Santos, Viviane Pereira Moreira, Renata Vieira, Silvia Moraes, and Alexandre Gonçalves Evsukoff. Portuguese word embeddings for the oil and gas industry: Development and evaluation. Computers in Industry, 124:103347, 2021.
Tao Gui, Ruotian Ma, Qi Zhang, Lujun Zhao, Yu-Gang Jiang, and Xuanjing Huang. Cnn-based chinese ner with lexicon rethinking. In ijcai, volume 2019, pages 4982–4988, 2019.
Daniel Hanisch, Katrin Fundel, Heinz-Theodor Mevissen, Ralf Zimmer, and Juliane Fluck. Prominer: rule-based protein and gene entity recognition. BMC bioinformatics, 6:1–9, 2005.
Yan Hu, Iqra Ameer, Xu Zuo, Xueqing Peng, Yujia Zhou, Zehan Li, Yiming Li, Jianfu Li, Xiaoqian Jiang, and Hua Xu. Zero-shot clinical entity recognition using chatgpt. arXiv preprint arXiv:2303.16416, 2023.
Zhiheng Huang, Wei Xu, and Kai Yu. Bidirectional lstm-crf models for sequence tagging. arXiv preprint arXiv:1508.01991, 2015.
Ashwin Ittoo, Antal van den Bosch, et al. Text analytics in industry: Challenges, desiderata and trends. Computers in Industry, 78:96–107, 2016.
Daniel Jurafsky and James H. Martin. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition with Language Models. 3rd edition, 2025. Online manuscript released January 12, 2025.
Imed Keraghel, Stanislas Morbieu, and Mohamed Nadif. Recent Advances in Named Entity Recognition: A Comprehensive Survey and Comparative Study. 2024.
John Lafferty, Andrew McCallum, Fernando Pereira, et al. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Icml, volume 1, page 3. Williamstown, MA, 2001.
Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4):1234–1240, 2020.
Andrew McCallum and Wei Li. Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003, pages 188–191, 2003.
Rafael O. Nunes, Andre S. Spritzer, Dennis G. Balreira, Carla M. D. S. Freitas, and Joel L. Carbonera. An evaluation of large language models for geological named entity recognition. In 2024 IEEE 36th International Conference on Tools with Artificial Intelligence (ICTAI), pages 494–501, 2024a. DOI: 10.1109/ICTAI62512.2024.00076.
Rafael Oleques Nunes, André Susliz Spritzer, Carla Maria Dal Sasso Freitas, and Dennis Giovani Balreira. Reconhecimento de entidades nomeadas e vazamento de dados em textos legislativos. Linguamática, 16(2):141–166, 2024b.
Rafael BM Rodrigues, Pedro IM Privatto, Gustavo José de Sousa, Rafael P Murari, Luis CS Afonso, João P Papa, Daniel CG Pedronette, Ivan R Guilherme, Stephan R Perrout, and Aliel F Riente. Petrobert: a domain adaptation language model for oil and gas applications in portuguese. In International Conference on Computational Processing of the Portuguese Language, pages 101–109. Springer, 2022.
Fábio Souza, Rodrigo Nogueira, and Roberto Lotufo. BERTimbau: pretrained BERT models for Brazilian Portuguese. In 9th Brazilian Conference on Intelligent Systems, BRACIS, Rio Grande do Sul, Brazil, October 20-23 (to appear), 2020.
Shuhe Wang, Xiaofei Sun, Xiaoya Li, Rongbin Ouyang, Fei Wu, Tianwei Zhang, Jiwei Li, and Guoyin Wang. Gpt-ner: Named entity recognition via large language models. arXiv preprint arXiv:2304.10428, 2023.
