SDDup: Confidentiality-Aware Semantic Deduplication
Resumo
Full-document encryption in existing document management systems creates major storage bottlenecks and hampers fine-grained access control. To overcome this, we present SDDup, a novel model that employs semantic-aware segmentation to separate common from unique, sensitive data segments. This allows for efficient deduplication and targeted encryption while maintaining integrity and regulatory compliance. We validate SDDup through theoretical and empirical analysis, security evaluation, and large-scale experiments on Brazilian birth certificates and university degrees. We compare SDDup against EDRStore, a leading system for generic encrypted data reduction. Our findings highlight SDDup as a competitive and scalable solution for document management.Referências
Bellare, M., Keelveedhi, S., and Ristenpart, T. (2013). Message-locked encryption and secure deduplication. In Annual international conference on the theory and applications of cryptographic techniques, pages 296–312. Springer.
Biega, A. J. and Finck, M. (2021). Reviving purpose limitation and data minimisation in data-driven systems. arXiv preprint arXiv:2101.06203.
Brasil (2018). Lei nº 13.079, de 14 de agosto de 2018. Lei Geral de Proteção de Dados Pessoais (LGPD). Diário Oficial da União, 157(1):59–64.
Brasil (2020). INSTRUÇÃO NORMATIVA Nº 1, DE 15 DE DEZEMBRO DE 2020.
Chakaravarthy, V. T., Gupta, H., Roy, P., and Mohania, M. K. (2008). Efficient techniques for document sanitization. In Proceedings of the 17th ACM conference on Information and knowledge management, pages 843–852.
European Union (2018). General data protection regulation, regulation (eu) 2016/679.
Friedlin, F. J. and McDonald, C. J. (2008). A Software Tool for Removing Patient Identifying Information from Clinical Documents. Journal of the American Medical Informatics Association, 15(5):601–610.
Iwendi, C., Moqurrab, S. A., Anjum, A., Khan, S., Mohan, S., and Srivastava, G. (2020). N-sanitization: A semantic privacy-preserving framework for unstructured medical datasets. Computer Communications, 161:160–171.
Kardaş, S. and Kiraz, M. S. (2016). Solving the secure storage dilemma: An efficient scheme for secure deduplication with privacy-preserving public auditing. Cryptology ePrint Archive.
Keelveedhi, S., Bellare, M., and Ristenpart, T. (2013). {DupLESS}:{Server-Aided} encryption for deduplicated storage. In 22nd USENIX security symposium (USENIX security 13), pages 179–194.
Li, J., Qin, C., Lee, P. P., and Li, J. (2016). Rekeying for encrypted deduplication storage. In 2016 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pages 618–629. IEEE.
Lin, J. C.-W., Srivastava, G., Zhang, Y., Djenouri, Y., and Aloqaily, M. (2021). Privacy-preserving multiobjective sanitization model in 6g iot environments. IEEE Internet of Things Journal, 8(7):5340–5349.
Liu, X., Yang, G., Susilo, W., Tonien, J., Chen, R., and Lv, X. (2020). Message-locked searchable encryption: A new versatile tool for secure cloud storage. IEEE Transactions on Services Computing, 15(3):1664–1677.
Miranda, M., Esteves, T., Portela, B., and Paulo, J. (2021). S2dedup: Sgx-enabled secure deduplication. In Proceedings of the 14th ACM international conference on systems and storage, pages 1–12.
Solovyev, A. V. (2023). The problem of defining the concept of “electronic document for long-term storage”. In Silhavy, R., Silhavy, P., and Prokopova, Z., editors, Data Science and Algorithms in Systems, pages 326–333, Cham. Springer International Publishing.
Stanešić, J., Morić, Z., Regvart, D., and Bencarić, I. (2025). Digital signatures and their legal significance. Edelweiss applied science and technology, 9(1):403–412.
Wu, Z., Xuan, S., Xie, J., Lin, C., and Lu, C. (2022). How to ensure the confidentiality of electronic medical records on the cloud: A technical perspective. Computers in biology and medicine, 147:105726.
Yan, Z., Jiang, H., Tan, Y., and Luo, H. (2016). Deduplicating compressed contents in cloud storage environment. In 8th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage 16).
Zhao, J., Yang, Z., Li, J., and Lee, P. P. (2024). Encrypted data reduction: Removing redundancy from encrypted data in outsourced storage. ACM Transactions on Storage, 20(4):1–30.
Biega, A. J. and Finck, M. (2021). Reviving purpose limitation and data minimisation in data-driven systems. arXiv preprint arXiv:2101.06203.
Brasil (2018). Lei nº 13.079, de 14 de agosto de 2018. Lei Geral de Proteção de Dados Pessoais (LGPD). Diário Oficial da União, 157(1):59–64.
Brasil (2020). INSTRUÇÃO NORMATIVA Nº 1, DE 15 DE DEZEMBRO DE 2020.
Chakaravarthy, V. T., Gupta, H., Roy, P., and Mohania, M. K. (2008). Efficient techniques for document sanitization. In Proceedings of the 17th ACM conference on Information and knowledge management, pages 843–852.
European Union (2018). General data protection regulation, regulation (eu) 2016/679.
Friedlin, F. J. and McDonald, C. J. (2008). A Software Tool for Removing Patient Identifying Information from Clinical Documents. Journal of the American Medical Informatics Association, 15(5):601–610.
Iwendi, C., Moqurrab, S. A., Anjum, A., Khan, S., Mohan, S., and Srivastava, G. (2020). N-sanitization: A semantic privacy-preserving framework for unstructured medical datasets. Computer Communications, 161:160–171.
Kardaş, S. and Kiraz, M. S. (2016). Solving the secure storage dilemma: An efficient scheme for secure deduplication with privacy-preserving public auditing. Cryptology ePrint Archive.
Keelveedhi, S., Bellare, M., and Ristenpart, T. (2013). {DupLESS}:{Server-Aided} encryption for deduplicated storage. In 22nd USENIX security symposium (USENIX security 13), pages 179–194.
Li, J., Qin, C., Lee, P. P., and Li, J. (2016). Rekeying for encrypted deduplication storage. In 2016 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pages 618–629. IEEE.
Lin, J. C.-W., Srivastava, G., Zhang, Y., Djenouri, Y., and Aloqaily, M. (2021). Privacy-preserving multiobjective sanitization model in 6g iot environments. IEEE Internet of Things Journal, 8(7):5340–5349.
Liu, X., Yang, G., Susilo, W., Tonien, J., Chen, R., and Lv, X. (2020). Message-locked searchable encryption: A new versatile tool for secure cloud storage. IEEE Transactions on Services Computing, 15(3):1664–1677.
Miranda, M., Esteves, T., Portela, B., and Paulo, J. (2021). S2dedup: Sgx-enabled secure deduplication. In Proceedings of the 14th ACM international conference on systems and storage, pages 1–12.
Solovyev, A. V. (2023). The problem of defining the concept of “electronic document for long-term storage”. In Silhavy, R., Silhavy, P., and Prokopova, Z., editors, Data Science and Algorithms in Systems, pages 326–333, Cham. Springer International Publishing.
Stanešić, J., Morić, Z., Regvart, D., and Bencarić, I. (2025). Digital signatures and their legal significance. Edelweiss applied science and technology, 9(1):403–412.
Wu, Z., Xuan, S., Xie, J., Lin, C., and Lu, C. (2022). How to ensure the confidentiality of electronic medical records on the cloud: A technical perspective. Computers in biology and medicine, 147:105726.
Yan, Z., Jiang, H., Tan, Y., and Luo, H. (2016). Deduplicating compressed contents in cloud storage environment. In 8th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage 16).
Zhao, J., Yang, Z., Li, J., and Lee, P. P. (2024). Encrypted data reduction: Removing redundancy from encrypted data in outsourced storage. ACM Transactions on Storage, 20(4):1–30.
Publicado
01/09/2025
Como Citar
MAYR, Lucas; SILVANO, Wellington Fernandes; HOLSTEIN, Gabriel; CUSTÓDIO, Ricardo.
SDDup: Confidentiality-Aware Semantic Deduplication. In: SIMPÓSIO BRASILEIRO DE CIBERSEGURANÇA (SBSEG), 25. , 2025, Foz do Iguaçu/PR.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2025
.
p. 806-821.
DOI: https://doi.org/10.5753/sbseg.2025.9797.
