Privacy Risks in Health Data: Investigating Inference of Sensitive Attributes of Citizens in DATASUS
Abstract
Statistical dissemination of health data is crucial for the formulation and monitoring of public policies and scientific research, but it presents important challenges regarding the privacy of data subjects. In this work, we formally and experimentally evaluate the risks of inferring sensitive attributes in the DATASUS outpatient procedure dataset, which contains microdata since 1994 to the present day on millions of citizens. We identified serious privacy risks – for example, in some cases it is possible to identify sensitive attributes with an accuracy higher than 90% in almost 30% of the records in the database. These results led to the question of whether the platform is compliant with the Lei Geral de Proteção de Dados (LGPD).
References
Alvim, M. S., Chatzikokolakis, K., McIver, A., Morgan, C., Palamidessi, C., and Smith, G. (2020a). The Science of Quantitative Information Flow. Information Security and Cryptography. Springer International Publishing, Cham, Switzerland.
Alvim, M. S., Fernandes, N., McIver, A., Morgan, C., and Nunes, G. H. (2022). Flexible and scalable privacy assessment for very large datasets, with an application to official governmental microdata. Proc. Priv. Enhancing Technol., 2022(4):378–399.
Alvim, M. S., Fernandes, N., McIver, A., and Nunes, G. H. (2020b). On Privacy and Accuracy in Data Releases (Invited Paper). In Konnov, I. and Kovács, L., editors, 31st International Conference on Concurrency Theory (CONCUR 2020), volume 171 of Leibniz International Proceedings in Informatics LIPIcs), pages 1:1–1:18, Dagstuhl, Germany. Schloss Dagstuhl–Leibniz-Zentrum für Informatik.
Athanasiou, A., Chatzikokolakis, K., and Palamidessi, C. (2024). Self-defense: Optimal qif solutions and application to website fingerprinting. arXiv preprint arXiv:2411.10059.
Dinur, I. and Nissim, K. (2003). Revealing information while preserving privacy. In Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pages 202–210.
Dwork, C. (2011). A firm foundation for private data analysis. Communications of the ACM, 54(1):86–95.
Dwork, C., McSherry, F., Nissim, K., and Smith, A. (2006). Calibrating noise to sensitivity in private data analysis. In Theory of cryptography conference, pages 265–284. Springer.
EU (2016). Regulation (eu) 2016/679 of the european parliament and of the council of 27 april 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing directive 95/46/ec (general data protection regulation). Available at [link].
Fernandes, N., McIver, A., and Sadeghi, P. (2024). Explaining epsilon in local differential privacy through the lens of quantitative information flow. In 2024 IEEE 37th Computer Security Foundations Symposium (CSF), pages 419–432. IEEE.
Fung, B. C., Wang, K., Fu, A. W.-C., and Philip, S. Y. (2010). Introduction to privacy-preserving data publishing: Concepts and techniques. Chapman and Hall/CRC.
Government of Australia (1988). Privacy Act 1988. [link].
Government of the United States of America (2002). Confidential information protection and statistical efficiency act (cipsea). [link].
Hundepool, A., Domingo-Ferrer, J., Franconi, L., Giessing, S., Nordholt, E. S., Spicer, K., and De Wolf, P.-P. (2012). Statistical disclosure control, volume 2. Wiley New York.
Johnson, A. E., Bulgarelli, L., Shen, L., Gayles, A., Shammout, A., Horng, S., Pollard, T. J., Hao, S., Moody, B., Gow, B., et al. (2023). Mimic-iv, a freely accessible electronic health record dataset. Scientific data, 10(1):1.
Jurado, M., Alvim, M., Gonze, R., and Palamidessi, C. (2023). Analyzing the shuffle model through the lens of quantitative information flow. Technical report.
Matthews, G. J. and Harel, O. (2011). Data confidentiality: A review of methods for statistical disclosure limitation and methods for assessing privacy. Statistics Surveys, 5:1–29.
Ministério da Saúde (2019). Informe técnico - disseminação de dados em saúde - siasus. DIAAD - Divisão de Análise e Administração de Dados. Available at [link].
Nunes, G. H. L. G. A. (2021). A formal quantitative study of privacy in the publication of official educational censuses in Brazil. Master’s thesis, Universidade Federal de Minas Gerais, Belo Horizonte, Minas Gerais, Brazil.
Organização das Nações Unidas (2014). Fundamental Principles of Official Statistics (A/RES/68/261 from 29 January 2014). Disponível em: [link].
Pollard, T. J., Johnson, A. E., Raffa, J. D., Celi, L. A., Mark, R. G., and Badawi, O. (2018). The eicu collaborative research database, a freely available multi-center database for critical care research. Scientific data, 5(1):1–13.
Sarmin, F. J., Sarkar, A. R., Wang, Y., and Mohammed, N. (2024). Synthetic data: Revisiting the privacy-utility trade-off. arXiv preprint arXiv:2407.07926.
Smith, G. (2009). On the foundations of quantitative information flow. In International Conference on Foundations of Software Science and Computational Structures, pages 288–302. Springer.
Sweeney, L. (2000). Simple Demographics Often Identify People Uniquely. Disponível em: [link].
