Privacy Risks Associated with the Use of LLMs in Software Development

Diego Menegazzi; Edna Dias Canedo

doi:10.5753/sbsi.2026.248323

Diego Menegazzi UnB
Edna Dias Canedo UnB

DOI: https://doi.org/10.5753/sbsi.2026.248323

Resumo

Research Context: Large Language Models (LLMs), e.g., GPT-2 and GPT-4, are increasingly embedded in software development to assist with code generation, testing, and documentation. Adoption promises productivity gains but also raises challenges for Information Systems (IS), particularly data privacy and regulatory compliance. Scientific and/or Practical Problem: Developers may inadvertently include personal or sensitive information in prompts. LLMs can temporarily retain and reproduce this data in later outputs, causing privacy breaches. This creates technical, ethical, and legal concerns under the Brazilian LGPD and the European GDPR. While recent studies survey developer perceptions, empirical evidence showing violations in practice remains scarce. Proposed Solution and/or Analysis: We present an experiment simulating a Membership Inference Attack (MIA) on a local GPT-2 instance. Using synthetic personal data with realistic identifiers, we evaluate how sensitive information can be recalled by the model and propose mitigation strategies aligned with regulatory and organizational frameworks. Related IS Theory: The study is grounded in the Socio-Technical Systems Theory and Information Systems models such as the People-Process-Technology (PPT) framework, emphasizing the alignment of human, organizational, and technological dimensions. It also connects to the GranDSI-BR 2016–2026 research challenges on privacy, ethics, and security in intelligent IS. Research Method: We adopted a mixed-methods approach, combining an experimental MIA simulation on GPT-2 with a literature review and regulatory analysis. Synthetic Brazilian personal data ensured realistic and ethically compliant experimentation. Summary of Results: The experiment showed that GPT-2 can reproduce sensitive identifiers from prior prompts, even without fine-tuning or persistent memory. Contributions and Impact to IS area: The study provides reproducible experimental evidence of privacy risks in LLMs and reinforces the urgency of embedding privacy-by-design principles into IS development workflows.

Referências

Birru, H., Cicchetti, A., and Latifaj, M. (2025). Supporting automated documentation updates in continuous software development with large language models. In Mannion, M., Männistö, T., and Maciaszek, L. A., editors, Proceedings of the 20th International Conference on Evaluation of Novel Approaches to Software Engineering, ENASE 2025, Porto, Portugal, April 4-6, 2025, pages 92–106. SCITEPRESS.

Choquet, G. et al. (2024). Exploiting privacy vulnerabilities in open-source llms: Prompt-based leakage in llama. In Proceedings of the IEEE Symposium on Security and Privacy.

de Cerqueira, J. A. S., Azevedo, A. P. D., Leão, H. A. T., and Canedo, E. D. (2022). Guide for artificial intelligence ethical requirements elicitation - RE4AI ethical guide. In 55th Hawaii International Conference on System Sciences, HICSS 2022, Virtual Event / Maui, Hawaii, USA, January 4-7, 2022, pages 1–10. ScholarSpace.

Falcão, F. D. S. and Canedo, E. D. (2024). Investigating software development teams members’ perceptions of data privacy in the use of large language models (llms). In Machado, I., Maldonado, J. C., Conte, T., Canedo, E. D., Marques, J., de França, B. B. N., Matsubara, P., Viana, D., Soares, S., Santos, G., Rocha, L., Gadelha, B., dos Santos, R. P., Oran, A. C., and Neto, A. G. S. S., editors, Proceedings of the XXIII Brazilian Symposium on Software Quality, SBQS 2024, Salvador, Bahia, Brazil, November 5-8, 2024, pages 373–382. ACM.

Ferrão, S. É. R., Silva, G. R. S., Canedo, E. D., and Mendes, F. F. (2024). Towards a taxonomy of privacy requirements based on the LGPD and ISO/IEC 29100. Inf. Softw. Technol., 168:107396.

GDPR (2016). Regulation (eu) 2016/679 of the european parliament and of the council of 27 april 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data (general data protection regulation). [link]. Official Journal of the European Union, L 119, 4 May 2016.

Gonçalves, C. D., de Paoli Menescal, E., de Mendonça, F. L. L., and Canedo, E. D. (2024). Trust in AI: perspectives of c-level executives in brazilian organizations. In Machado, I., Maldonado, J. C., Conte, T., Canedo, E. D., Marques, J., de França, B. B. N., Matsubara, P., Viana, D., Soares, S., Santos, G., Rocha, L., Gadelha, B., dos Santos, R. P., Oran, A. C., and Neto, A. G. S. S., editors, Proceedings of the XXIII Brazilian Symposium on Software Quality, SBQS 2024, Salvador, Bahia, Brazil, November 5-8, 2024, pages 147–157. ACM.

Kim, D. and Hua, M. (2025). Assessing output reliability and similarity of large language models in software development: A comparative case study approach. Inf. Softw. Technol., 185:107787.

Kim, S. et al. (2024). Propile: Probing privacy leaks from intermediate layer embeddings. arXiv preprint arXiv:2402.00888.

LGPD (2018). Lei geral de proteção de dados pessoais (lgpd), lei nº 13.709, de 14 de agosto de 2018. [link]. [LGPD 2018].

Liu, Y., He, H., Han, T., Zhang, X., et al. (2024). Understanding llms: A comprehensive overview from training to inference. arXiv preprint arXiv:2401.02038.

Martin, M., Coutinho, D., Uchôa, A., and Pereira, J. (2025). Evaluating the potential of large language models in security-related software requirements classification. In Anais do XXXIX Simpósio Brasileiro de Engenharia de Software, pages 315–325, Porto Alegre, RS, Brasil. SBC.

Mauran, C. (2023). Samsung bans chatgpt, ai chatbots after data leak blunder. Mashable. Accessed: 2024-04-18.

Menegazzi, D. and Canedo, E. D. (2026). Research artefact: Privacy risks associated with the use of llms in software development. [link]. Zenodo. DOI: 10.5281/zenodo.17298895.

Nam, J. and Kim, H. (2024). Understanding code with large language models: Challenges and opportunities. Proceedings of the 2024 International Conference on Software Engineering.

Rocha, L. D. and Canedo, E. D. (2025). Optimizing compliance: Comparative study of data laws and privacy frameworks. Journal of Internet Services and Applications, 16(1):431–452.

Rocha, L. D., Silva, G. R. S., and Canedo, E. D. (2023). Privacy compliance in software development: A guide to implementing the LGPD principles. In Hong, J., Lanperne, M., Park, J. W., Cerný, T., and Shahriar, H., editors, Proceedings of the 38th ACM/SIGAPP Symposium on Applied Computing, SAC 2023, Tallinn, Estonia, March 27-31, 2023, pages 1352–1361. ACM.

Shokri, R., Stronati, M., Song, C., and Shmatikov, V. (2017). Membership inference attacks against machine learning models. In 2017 IEEE Symposium on Security and Privacy (SP), pages 3–18. IEEE.

Singhal, J. et al. (2024). Preventing sensitive output generation in llama-2 via instructional fine-tuning. arXiv preprint arXiv:2406.00240.

Yan, B., Li, K., Xu, M., et al. (2024). On protecting the data privacy of llms: A survey. arXiv preprint arXiv:2403.05156.

Yao, A. and Zhang, B. (2023). A survey on llm privacy risks. Journal of AI Research.

Privacy Risks Associated with the Use of LLMs in Software Development

Resumo

Referências

Artigos mais lidos do(s) mesmo(s) autor(es)