RPAs and Data Lakes for Industry 4.0: A Case Study of Integrated Data Ecosystem
Abstract
The use of RPAs has accelerated process automation in corporate environments but presents performance limitations when dealing with large volumes of data. This context reveals opportunities for improving the scalability and efficiency of such solutions. This paper proposes a distributed, modular, and loosely coupled Data Lake architecture for the collection, storage, and processing of heterogeneous legacy data. The solution leverages open-source tools such as Hadoop, Spark, and Airflow, organized into functional layers. A case study was implemented using production line data at a multinational electronics company, demonstrating the feasibility and benefits of the proposed approach.
References
Imperva (2024). Imperva 2024 bad bot report. Technical report, Imperva Inc. Accessed: 2025-05-19.
Khine, Pwint Phyu and Wang, Zhao Shun (2018). Data lake: a new ideology in big data era. ITM Web Conf., 17:03025.
Kothandapani, H. P. (2021). Integrating robotic process automation and machine learning in data lakes for automated model deployment, retraining, and data-driven decision making.
Minh, T. P., Quang, H. H., and Manh, T. N. (2024). A zone-based data lake architecture for smart crop farming in vietnam: A strategic perspective. In Proceedings of the 2nd International Conference - Resilience by Technology and Design (RTD 2024), pages 29–44. Atlantis Press.
Nargesian, F., Zhu, E., Miller, R. J., Pu, K. Q., and Arocena, P. C. (2019). Data lake management: challenges and opportunities. Proc. VLDB Endow., 12(12):1986–1989.
Pereira, A. and Simonetto, E. (2018). Indústria 4.0: Conceitos e perspectivas para o brasil. Revista da Universidade Vale do Rio Verde, 16(1). Doutorando e professor do Programa de Pós-Graduação em Administração, UFSM.
Ribeiro, J., Lima, R., Eckhardt, T., and Paiva, S. (2021). Robotic process automation and artificial intelligence in industry 4.0 – a literature review. Procedia Computer Science, 181:51–58. CENTERIS 2020 - International Conference on ENTERprise Information Systems /ProjMAN 2020 - International Conference on Project MANagement /HCist 2020 - International Conference on Health and Social Care Information Systems and Technologies 2020, CENTERIS/ProjMAN/HCist 2020.
Rodrigues, J. and Mello, R. (2022). Um estudo sobre arquiteturas e metadados em data lakes. In Anais da XVII Escola Regional de Banco de Dados, pages 131–134, Porto Alegre, RS, Brasil. SBC.
Shukla, S. (2022). Developing pragmatic data pipelines using apache airflow on google cloud platform. INTERNATIONAL JOURNAL OF COMPUTER SCIENCES AND ENGINEERING, 10:1–8.
Tito, L., Motinha, C., Santiago, F., Ocaña, K., Bedo, M., and de Oliveira, D. (2020). Xi-dl: um sistema de gerência de data lake para monitoramento de dados da saúde. In Anais do XXXV Simpósio Brasileiro de Bancos de Dados, pages 151–156, Porto Alegre, RS, Brasil. SBC.
Vasconcelos, F. F. and Coutinho, F. J. (2024). Data lakehouses para a análise de dados geoespaciais em larga escala. In Anais do XXXIX Simpósio Brasileiro de Bancos de Dados, pages 722–728, Porto Alegre, RS, Brasil. SBC.
Yang, C.-T., Chen, T.-Y., Kristiani, E., and Wu, S. F. (2021). The implementation of data storage and analytics platform for big data lake of electricity usage with spark. The Journal of Supercomputing, 77(6):5934–5959.
