ARANI: An Experiment Line-Based Approach for Privacy Preservation in Data Lakes

Abstract


Data Lakes store large volumes of heterogeneous data, including sensitive information. Ensuring compliance with regulations such as the LGPD requires the use of anonymization techniques. Techniques applied in isolation, such as k-Anonymity or Differential Privacy, may be insufficient. Therefore, combining these techniques in configurable flows is essential. Experiment Lines enable the flexible structuring and instantiation of these flows. This paper proposes ARANI, an Experiment Line-based approach that allows the definition, execution, and evaluation of anonymization flows with support for multiple techniques.
Keywords: Security, Privacy, Anonimization, Data Lake, Experiment Line

References

Barros, P. V. d. S. et al. (2024). Incorporando os requisitos e as restrições da lgpd ao projeto de banco de dados. In SBBD’24, pages 341–353. SBC.

Bauer, D. et al. (2022). Revisiting data lakes: the metadata lake. In Middleware’22, page 8–14, New York, NY, USA.

Becker, B. and Kohavi, R. (1996). Adult. UCI Machine Learning Repository. DOI: 10.24432/C5XW20.

Deshpande, A. (2021). Sypse: privacy-first data management through pseudonymization and partitioning. In CIDR, pages 1–8, Chaminade, CA.

Domingo-Ferrer, J. and Torra, V. (2005). Ordinal, continuous and heterogeneous k-anonymity through microaggregation. Data Mining and Knowledge Discovery, 11(2):195–212.

Dwork, C., McSherry, F., Nissim, K., and Smith, A. (2006). Calibrating noise to sensitivity in private data analysis. In TCC 2006, volume 3876, pages 265–284. Springer.

Francis, P., Probst-Eide, S., Obrok, P., Berneanu, C., Juric, S., and Munz, R. (2018). Diffix-birch: Extending diffix-aspen. arXiv preprint arXiv:1806.02075.

Giomi, M. et al. (2023). A unified framework for quantifying privacy risk in synthetic data. Proceedings on Privacy Enhancing Technologies, 2023(2):312–328.

Machado, J. C. and Amora, P. R. (2021). The impact of privacy regulations on db systems. Journal of Information and Data Management, 12(5).

Machanavajjhala, A., Kifer, D., Gehrke, J., and Venkitasubramaniam, M. (2007). L-diversity: Privacy beyond k-anonymity. ACM Trans. Knowl. Discov. Data, 1(1):3–es.

Miguel, J., Pereira, M. J., Henriques, P., and Berón, M. (2019). Assuring data privacy with privas – a tool for data publishers. IADIS International Journal on Computer Science and Information Systems, 14(2):41–58.

Nargesian, F., Zhu, E., Miller, R. J., Pu, K. Q., and Arocena, P. C. (2019). Data lake management: Challenges and opportunities. Proc. VLDB Endow., 12(12):1986–1989.

Ogasawara, E. et al. (2009). Experiment line: software reuse in scientific workflows. In Proc. of the SSDBM 2009, pages 264–272, Berlin. Springer.

Oreščanin, D., Hlupić, T., and Vrdoljak, B. (2024). Managing personal identifiable information in data lakes. IEEE access, 12:32164–32180.

Poulis, G. et al. (2014). SECRETA: A system for evaluating and comparing relational and transaction anonymization algorithms. In EDBT’14, pages 620–623.

Prasser, F., Eicher, J., et al. (2020). Flexible data anonymization using arx—current status and challenges ahead. Software: Pract. and Exp., 50(7):1277–1304.

Sweeney, L. (2002). k-anonymity: a model for protecting privacy. Int. J. Uncertain. Fuzziness Knowl.-Based Syst., 10(5):557–570.

Terrovitis, M., Liagouris, J., Mamoulis, N., and Skiadopoulos, S. (2012). Privacy preservation by disassociation. arXiv preprint arXiv:1207.0135.

Zigomitros, A., Casino, F., Solanas, A., and Patsakis, C. (2020). A survey on privacy properties for data publishing of relational data. Ieee Access, 8:51071–51099.
Published
2025-09-29
JORDÃO, Thiago; BEDO, Marcos; DE OLIVEIRA, Daniel. ARANI: An Experiment Line-Based Approach for Privacy Preservation in Data Lakes. In: BRAZILIAN SYMPOSIUM ON DATABASES (SBBD), 40. , 2025, Fortaleza/CE. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2025 . p. 844-850. ISSN 2763-8979. DOI: https://doi.org/10.5753/sbbd.2025.247756.