Named Entity Extraction in Industrial Diagrams with Computer Vision and MLLMs
Resumo
This paper presents a comprehensive evaluation of methods for extracting named entities (tags) from Piping and Instrumentation Diagrams (P&IDs), critical documents used throughout the lifecycle of industrial facilities. Efficient indexing and retrieval of information from such technical documentation play a key role in improving industrial project management. The paper proposes integrating deep neural networks for object detection with state-of-the-art Multimodal Large Language Models (MLLMs), specifically GPT-4o and LLaMA-3.2-Vision-90B. The methodology includes a preprocessing stage to isolate and downscale symbols before tag extraction. Experimental results on a challenging dataset of 30 real-world symbols show that MLLM-based approaches significantly outperform traditional OCR techniques (82%), achieving near-perfect accuracy in most cases. The approach demonstrates the robustness of MLLMs under noisy and degraded conditions, offering a practical solution for augmenting engineering document databases with reliable named entities. These findings highlight the potential of MLLMs to enable robust, high-precision tag text extraction, thereby streamlining the indexing and management of complex engineering documentation. All 30 test symbols used in this study are provided as supplementary material to ensure reproducibility.Referências
J. Wang, H. Jiang, Y. Liu, C. Ma, X. Zhang, Y. Pan, M. Liu, P. Gu, S. Xia, W. Li, Y. Zhang, Z. Wu, Z. Liu, T. Zhong, B. Ge, T. Zhang, N. Qiang, X. Hu, X. Jiang, X. Zhang, W. Zhang, D. Shen, T. Liu, and S. Zhang, “A comprehensive review of multimodal large language models: Performance and challenges across different tasks,” 2024. [Online]. Available: [link]
E.-S. Yu, J.-M. Cha, T. Lee, J. Kim, and D. Mun, “Features recognition from piping and instrumentation diagrams in image format using a deep learning network,” Energies, vol. 12, no. 23, p. 4425, 2019.
H. Kim, W. Lee, M. Kim, Y. Moon, T. Lee, M. Cho, and D. Mun, “Deep-learning-based recognition of symbols and texts at an industrially applicable level from images of high-density piping and instrumentation diagrams,” Expert Systems with Applications, vol. 183, p. 115337, 2021.
M. Francois, V. Eglin, and M. Biou, “Text detection and post-ocr correction in engineering documents,” in International Workshop on Document Analysis Systems. Springer, 2022, pp. 726–740.
X. Zhou, C. Yao, H. Wen, Y. Wang, S. Zhou, W. He, and J. Liang, “East: an efficient and accurate scene text detector,” in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2017, pp. 5551–5560.
R. Smith, “An overview of the tesseract ocr engine,” in Ninth international conference on document analysis and recognition (ICDAR 2007), vol. 2. IEEE, 2007, pp. 629–633.
S. Paliwal, A. Jain, M. Sharma, and L. Vig, “Digitize-pid: Automatic digitization of piping and instrumentation diagrams,” in Trends and Applications in Knowledge Discovery and Data Mining: PAKDD 2021 Workshops, WSPA, MLMEIN, SDPRA, DARAI, and AI4EPT, Delhi, India, May 11, 2021 Proceedings 25. Springer, 2021, pp. 168–180.
R. Rahul, S. Paliwal, M. Sharma, and L. Vig, “Automatic information extraction from piping and instrumentation diagrams,” in Proceedings of the 8th International Conference on Pattern Recognition Applications and Methods - ICPRAM, INSTICC. SciTePress, 2019, pp. 163–172.
Z. Tian, W. Huang, T. He, P. He, and Y. Qiao, “Detecting text in natural image with connectionist text proposal network,” in Computer vision–ECCV 2016: 14th European conference, amsterdam, the netherlands, October 11-14, 2016, proceedings, part VIII 14. Springer, 2016, pp. 56–72.
R. Brunelli, Template Matching Techniques in Computer Vision: Theory and Practice. John Wiley & Sons, 2009. [Online]. Available: [link]
S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: towards real-time object detection with region proposal networks,” in Proceedings of the 29th International Conference on Neural Information Processing Systems - Volume 1, ser. NIPS’15. Cambridge, MA, USA: MIT Press, 2015, p. 91–99.
W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, “SSD: Single shot multibox detector,” in Computer Vision – ECCV 2016, ser. Lecture Notes in Computer Science, B. Leibe, J. Matas, N. Sebe, and M. Welling, Eds., vol. 9905. Springer, 2016, pp. 21–37. [Online]. DOI: 10.1007/978-3-319-46448-0_2
R. Smith, “An overview of the tesseract ocr engine,” in ICDAR ’07: Proceedings of the Ninth International Conference on Document Analysis and Recognition. Washington, DC, USA: IEEE Computer Society, 2007, pp. 629–633. [Online]. Available: [link]
R. W. Smith, “The extraction and recognition of text from multimedia document images.” DISS. ABST. INT. PT. B-SCI. & ENG., 1988,, vol. 49, no. 4, 1988.
OpenAI, “Introducing gpt-4o and more tools to chatgpt free users,” [link], 2024, [Online]. Accessed: 27 May 2025.
Meta, “Llama 3.2: Revolutionizing edge ai and vision with open, customizable models,” [link], 2024, [Online]. Accessed: 27 May 2025.
X. Sun, J. Gu, and H. Sun, “Research progress of zero-shot learning,” Applied Intelligence, vol. 51, no. 5, pp. 3600–3614, 2021, accepted: 07 November 2020; Published online: 16 November 2020; Issue date: June 2021. [Online]. DOI: 10.1007/s10489-020-02075-7
Ollama, “Ollama: Run large language models locally,” [link], 2024, [Online]. Accessed: 2 June 2025.
E.-S. Yu, J.-M. Cha, T. Lee, J. Kim, and D. Mun, “Features recognition from piping and instrumentation diagrams in image format using a deep learning network,” Energies, vol. 12, no. 23, p. 4425, 2019.
H. Kim, W. Lee, M. Kim, Y. Moon, T. Lee, M. Cho, and D. Mun, “Deep-learning-based recognition of symbols and texts at an industrially applicable level from images of high-density piping and instrumentation diagrams,” Expert Systems with Applications, vol. 183, p. 115337, 2021.
M. Francois, V. Eglin, and M. Biou, “Text detection and post-ocr correction in engineering documents,” in International Workshop on Document Analysis Systems. Springer, 2022, pp. 726–740.
X. Zhou, C. Yao, H. Wen, Y. Wang, S. Zhou, W. He, and J. Liang, “East: an efficient and accurate scene text detector,” in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2017, pp. 5551–5560.
R. Smith, “An overview of the tesseract ocr engine,” in Ninth international conference on document analysis and recognition (ICDAR 2007), vol. 2. IEEE, 2007, pp. 629–633.
S. Paliwal, A. Jain, M. Sharma, and L. Vig, “Digitize-pid: Automatic digitization of piping and instrumentation diagrams,” in Trends and Applications in Knowledge Discovery and Data Mining: PAKDD 2021 Workshops, WSPA, MLMEIN, SDPRA, DARAI, and AI4EPT, Delhi, India, May 11, 2021 Proceedings 25. Springer, 2021, pp. 168–180.
R. Rahul, S. Paliwal, M. Sharma, and L. Vig, “Automatic information extraction from piping and instrumentation diagrams,” in Proceedings of the 8th International Conference on Pattern Recognition Applications and Methods - ICPRAM, INSTICC. SciTePress, 2019, pp. 163–172.
Z. Tian, W. Huang, T. He, P. He, and Y. Qiao, “Detecting text in natural image with connectionist text proposal network,” in Computer vision–ECCV 2016: 14th European conference, amsterdam, the netherlands, October 11-14, 2016, proceedings, part VIII 14. Springer, 2016, pp. 56–72.
R. Brunelli, Template Matching Techniques in Computer Vision: Theory and Practice. John Wiley & Sons, 2009. [Online]. Available: [link]
S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: towards real-time object detection with region proposal networks,” in Proceedings of the 29th International Conference on Neural Information Processing Systems - Volume 1, ser. NIPS’15. Cambridge, MA, USA: MIT Press, 2015, p. 91–99.
W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, “SSD: Single shot multibox detector,” in Computer Vision – ECCV 2016, ser. Lecture Notes in Computer Science, B. Leibe, J. Matas, N. Sebe, and M. Welling, Eds., vol. 9905. Springer, 2016, pp. 21–37. [Online]. DOI: 10.1007/978-3-319-46448-0_2
R. Smith, “An overview of the tesseract ocr engine,” in ICDAR ’07: Proceedings of the Ninth International Conference on Document Analysis and Recognition. Washington, DC, USA: IEEE Computer Society, 2007, pp. 629–633. [Online]. Available: [link]
R. W. Smith, “The extraction and recognition of text from multimedia document images.” DISS. ABST. INT. PT. B-SCI. & ENG., 1988,, vol. 49, no. 4, 1988.
OpenAI, “Introducing gpt-4o and more tools to chatgpt free users,” [link], 2024, [Online]. Accessed: 27 May 2025.
Meta, “Llama 3.2: Revolutionizing edge ai and vision with open, customizable models,” [link], 2024, [Online]. Accessed: 27 May 2025.
X. Sun, J. Gu, and H. Sun, “Research progress of zero-shot learning,” Applied Intelligence, vol. 51, no. 5, pp. 3600–3614, 2021, accepted: 07 November 2020; Published online: 16 November 2020; Issue date: June 2021. [Online]. DOI: 10.1007/s10489-020-02075-7
Ollama, “Ollama: Run large language models locally,” [link], 2024, [Online]. Accessed: 2 June 2025.
Publicado
30/09/2025
Como Citar
WEIBULL, Jon Karl; SANTOS, Filip Duarte dos; SOUZA, Jefferson Alves de; LIMA, Denyson Tomaz de; LEMOS, Melissa; CASANOVA, Marco Antonio.
Named Entity Extraction in Industrial Diagrams with Computer Vision and MLLMs. In: WORKSHOP DE APLICAÇÕES INDUSTRIAIS - CONFERENCE ON GRAPHICS, PATTERNS AND IMAGES (SIBGRAPI), 38. , 2025, Salvador/BA.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2025
.
p. 303-308.
