Can Language Models Generate Secure Terraform Code? A Security-Focused Benchmark Using Static Analysis

Resumo


The widespread adoption of Infrastructure-as-Code (IaC) has made cloud misconfiguration a critical security concern, while Large Language Models (LLMs) and Small Language Models (SLMs) have been redefining the programming process. We present an empirical benchmark evaluating whether LLMs and SLMs can generate security-compliant AWS Terraform configurations. Our automated pipeline integrates Checkov and Trivy into a GitLab CI/CD workflow across four Amazon S3 scenarios, evaluating six models under three prompt strategies of increasing security specificity. Security compliance improves consistently with prompt detail, though no model achieves full compliance in any configuration. Our findings suggest that prompt design is a critical factor, highlighting the need for a proper pipeline for developing and validating LLM-assisted secure IaC generation. All artifacts are publicly available.

Referências

Buehler, N. et al. (2026). TerraFormer: Automated infrastructure-as-code with LLMs fine-tuned via policy-guided verifier feedback. arXiv preprint arXiv:2601.08734.

Chen, M. et al. (2021). Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374.

Cloud Security Alliance (2024). Top threats to cloud computing 2024. Technical report, Cloud Security Alliance.

Continella, A., Polino, M., Pogliani, M., and Zanero, S. (2018). There’s a hole in that bucket! A large-scale analysis of misconfigured S3 buckets. In Proceedings of the 34th Annual Computer Security Applications Conference (ACSAC), pages 702–711. ACM.

Fang, C., Miao, N., Srivastav, S., Liu, J., Zhang, R., Fang, R., Tsang, R., Nazari, N., Wang, H., and Homayoun, H. (2024). Large language models for code analysis: Do LLMs really do their job? In 33rd USENIX Security Symposium (USENIX Security), pages 829–846. USENIX Association.

Kon, P. T., Liu, J., Qiu, Y., Fan, W., He, T., Lin, L., Zhang, H., Park, O. M., Elengikal, G. S., Kang, Y., et al. (2024). IaC-Eval: A code generation benchmark for cloud infrastructure-as-code programs. In Advances in Neural Information Processing Systems, volume 37, pages 134488–134512.

Morris, K. (2020). Infrastructure as Code: Dynamic Systems for the Cloud Age. O’Reilly Media, 2nd edition.

Opdebeeck, R., Zerouali, A., and De Roover, C. (2023). Control and data flow in security smell detection for infrastructure as code: Is it worth the effort? In Proceedings of the 20th IEEE/ACM International Conference on Mining Software Repositories (MSR), pages 534–545. IEEE.

Pearce, H., Ahmad, B., Tan, B., Dolan-Gavitt, B., and Karri, R. (2022). Asleep at the keyboard? Assessing the security of GitHub Copilot’s code contributions. In Proceedings of the 43rd IEEE Symposium on Security and Privacy (S&P), pages 754–768. IEEE.

Perry, N., Srivastava, M., Kumar, D., and Boneh, D. (2023). Do users write more insecure code with AI assistants? In Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security (CCS), pages 2785–2799. ACM.

Rahman, A., Parnin, C., and Williams, L. (2019). The seven sins: Security smells in infrastructure as code scripts. In Proceedings of the 41st IEEE/ACM International Conference on Software Engineering (ICSE), pages 164–175. IEEE. Distinguished Paper Award.

Rahman, A., Rahman, M. R., Parnin, C., and Williams, L. (2021). Security smells in Ansible and Chef scripts: A replication study. ACM Transactions on Software Engineering and Methodology, 30(1).

Rahman, A., Shamim, S. I., Bose, D. B., and Pandita, R. (2023). Security misconfigurations in open source Kubernetes manifests: An empirical study. ACM Transactions on Software Engineering and Methodology, 32(4).

Saavedra, N. and Ferreira, J. F. (2022). GLITCH: Automated polyglot security smell detection in infrastructure as code. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 1–12. ACM.

SentinelOne (2024). Cloud security report 2024. Technical report, SentinelOne.

Verdet, A., Hamdaqa, M., Da Silva, L., and Khomh, F. (2025). Assessing the adoption of security policies by developers in Terraform across different cloud providers. Empirical Software Engineering, 30(3).

Zhang, T., Pan, S., Zhang, Z., Xing, Z., and Sun, X. (2025). Deployability-centric infrastructure-as-code generation: An LLM-based iterative framework. arXiv preprint arXiv:2506.05623.
Publicado
19/07/2026
VARGAS, Francis Luis Santos; MANSILHA, Rodrigo Brandão; KREUTZ, Diego. Can Language Models Generate Secure Terraform Code? A Security-Focused Benchmark Using Static Analysis. In: SIMPÓSIO DE INFRAESTRUTURA DIGITAL/NUVEM PARA PESQUISA (PESQUISA@NUVEM), 1. , 2026, Gramado/RS. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2026 . p. 29-38. DOI: https://doi.org/10.5753/pesquisanuvem.2026.23954.