Methodologies and tools for data governance applied to data lake management: a systematic review
Abstract
A data lake is a centralized repository designed to store diverse types of data, regardless of format or structure. While this flexibility offers significant advantages, it also poses the risk of turning the repository into a data swamp that is a scenario where disorganized, inconsistent, or low-value data accumulates. To mitigate this risk, the adoption of robust data governance practices is essential to ensure proper organization and efficient data management. Given the existing knowledge gaps concerning effective governance implementation in data lakes, this study presents a systematic literature review aimed at identifying the methodologies and tools currently employed in managing such repositories.References
Bližnák, K., Munk, M., and Pilková, A. (2024). A systematic review of recent literature on data governance (2017–2023). IEEE Access, 12:149875–149888.
Cherradi, M., Bouhafer, F., and Haddadi, A. E. (2023). Data lake governance using ibm-watson knowledge catalog. Scientific African, 21.
Cherradi, M. and El Haddadi, A. (2024). Enhancing data lake management systems with lda approach. Journal of Data Science and Intelligent Systems, 3(1):58–66.
DAMA International (2017). DAMA-DMBOK: Data Management Body of Knowledge. Technics Publications, USA, 2 edition.
Derakhshannia, M., Gervet, C., Hajj-Hassan, H., Laurent, A., and Martin, A. (2020). Data lake governance: Towards a systemic and natural ecosystem analogy. Future Internet, 12:1–16.
Derakhshannia, M., Laurent, A., and Martin, A. (2023). Mixing biology and computer science concepts to design resilient data lakes. Journal of Interdisciplinary Methodologies and Issues in Science, 11.
Galvão, M. C. B. and Ricarte, I. L. M. (2020). Revisão sistemática da literatura: conceituação, produção e publicação. LOGEION: Filosofia da informação, 6:57–63.
Garriga, M., Aarns, K., Tsigkanos, C., Tamburri, D. A., and Heuvel, W. V. D. (2021). Dataops for cyber-physical systems governance: The airport passenger flow case. ACM Transactions on Internet Technology, 21(2):Article 36, 25 pages.
Giebler, C., Gröger, C., Hoos, E., Schwarz, H., and Mitschang, B. (2020). A zone reference model for enterprise-grade data lake management. In 2020 IEEE 24th International Enterprise Distributed Object Computing Conference (EDOC), pages 57–66, Eindhoven, Netherlands.
Gyulgyulyan, E. and Astsatryan, H. (2023). Alert system for data quality in data lakes. In CSIT Conference 2023, Yerevan, Armenia.
Hamadou, H. B., Bach Pedersen, T., and Thomsen, C. (2020). The danish national energy data lake: Requirements, technical architecture, and tool selection. In 2020 IEEE International Conference on Big Data (Big Data), pages 1523–1532, Atlanta, GA, USA.
Ishwarappa and Anuradha, J. (2015). A brief introduction on big data 5vs characteristics and hadoop technology. Procedia Computer Science, 48:319–324.
Nambiar, A. and Mundra, D. (2022). An overview of data warehouse and data lake in modern enterprise data management. Big Data and Cognitive Computing, 6(4):132.
O’Brien, M. A., Mohally, D., Brasche, G. P., and Sanfilippo, A. G. (2022). Huawei and international data spaces. In Otto, B., ten Hompel, M., and Wrobel, S., editors, Designing Data Spaces. Springer, Cham.
Plebani, P., Kat, R., Pallas, F., Werner, S., Inches, G., Laud, P., and Santiago, R. (2023). Teadal: Trustworthy, energy-aware federated data lakes along the computing continuum. In CEUR Workshop Proceedings, volume 3413, pages 28–35.
Sarramia, D., Claude, A., Ogereau, F., Mezhoud, J., and Mailhot, G. (2022). Ceba: A data lake for data sharing and environmental monitoring. Sensors, 22:2733.
Sosa, D. and Paciello, J. (2021). Data lake: A case of study of a big data analytics architecture for public procurements. In 2021 Eighth International Conference on eDemocracy & eGovernment (ICEDEG), pages 194–198, Quito, Ecuador.
Wang, H., Adenutsi, C. D., Wang, C., Sun, Z., Zhang, Y., Li, Y., Zhang, Z., and Wang, J. (2023). Construction and application of a big data system for regional lakes in coalbed methane development. ACS Omega, 8(20):18323–18331.
Cherradi, M., Bouhafer, F., and Haddadi, A. E. (2023). Data lake governance using ibm-watson knowledge catalog. Scientific African, 21.
Cherradi, M. and El Haddadi, A. (2024). Enhancing data lake management systems with lda approach. Journal of Data Science and Intelligent Systems, 3(1):58–66.
DAMA International (2017). DAMA-DMBOK: Data Management Body of Knowledge. Technics Publications, USA, 2 edition.
Derakhshannia, M., Gervet, C., Hajj-Hassan, H., Laurent, A., and Martin, A. (2020). Data lake governance: Towards a systemic and natural ecosystem analogy. Future Internet, 12:1–16.
Derakhshannia, M., Laurent, A., and Martin, A. (2023). Mixing biology and computer science concepts to design resilient data lakes. Journal of Interdisciplinary Methodologies and Issues in Science, 11.
Galvão, M. C. B. and Ricarte, I. L. M. (2020). Revisão sistemática da literatura: conceituação, produção e publicação. LOGEION: Filosofia da informação, 6:57–63.
Garriga, M., Aarns, K., Tsigkanos, C., Tamburri, D. A., and Heuvel, W. V. D. (2021). Dataops for cyber-physical systems governance: The airport passenger flow case. ACM Transactions on Internet Technology, 21(2):Article 36, 25 pages.
Giebler, C., Gröger, C., Hoos, E., Schwarz, H., and Mitschang, B. (2020). A zone reference model for enterprise-grade data lake management. In 2020 IEEE 24th International Enterprise Distributed Object Computing Conference (EDOC), pages 57–66, Eindhoven, Netherlands.
Gyulgyulyan, E. and Astsatryan, H. (2023). Alert system for data quality in data lakes. In CSIT Conference 2023, Yerevan, Armenia.
Hamadou, H. B., Bach Pedersen, T., and Thomsen, C. (2020). The danish national energy data lake: Requirements, technical architecture, and tool selection. In 2020 IEEE International Conference on Big Data (Big Data), pages 1523–1532, Atlanta, GA, USA.
Ishwarappa and Anuradha, J. (2015). A brief introduction on big data 5vs characteristics and hadoop technology. Procedia Computer Science, 48:319–324.
Nambiar, A. and Mundra, D. (2022). An overview of data warehouse and data lake in modern enterprise data management. Big Data and Cognitive Computing, 6(4):132.
O’Brien, M. A., Mohally, D., Brasche, G. P., and Sanfilippo, A. G. (2022). Huawei and international data spaces. In Otto, B., ten Hompel, M., and Wrobel, S., editors, Designing Data Spaces. Springer, Cham.
Plebani, P., Kat, R., Pallas, F., Werner, S., Inches, G., Laud, P., and Santiago, R. (2023). Teadal: Trustworthy, energy-aware federated data lakes along the computing continuum. In CEUR Workshop Proceedings, volume 3413, pages 28–35.
Sarramia, D., Claude, A., Ogereau, F., Mezhoud, J., and Mailhot, G. (2022). Ceba: A data lake for data sharing and environmental monitoring. Sensors, 22:2733.
Sosa, D. and Paciello, J. (2021). Data lake: A case of study of a big data analytics architecture for public procurements. In 2021 Eighth International Conference on eDemocracy & eGovernment (ICEDEG), pages 194–198, Quito, Ecuador.
Wang, H., Adenutsi, C. D., Wang, C., Sun, Z., Zhang, Y., Li, Y., Zhang, Z., and Wang, J. (2023). Construction and application of a big data system for regional lakes in coalbed methane development. ACS Omega, 8(20):18323–18331.
Published
2025-08-12
How to Cite
SANTOS, Wyllyany C.; LIMA, David H. S.; SILVA, Carlos A. F.; FERRO, Márcio R. C..
Methodologies and tools for data governance applied to data lake management: a systematic review. In: REGIONAL SCHOOL ON COMPUTING OF BAHIA, ALAGOAS, AND SERGIPE (ERBASE), 25. , 2025, Lagarto/SE.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2025
.
p. 336-344.
DOI: https://doi.org/10.5753/erbase.2025.13809.
