Improving Source Code Security: A Novel Approach Using Spectrum of Prompts and Automated State Machine
Abstract
As software security becomes increasingly vital, automating source code vulnerability remediation is essential for enhancing system reliability. This research presents an integrated framework that combines large language models (LLMs) with a patch-compile-test state machine (PCT-SM) to generate accurate, functional code repairs with minimal human intervention. The solution is organized into three stages: the Editing Plan to identify necessary code edits; the Patch Plan to generate unified patches; and the Verification Plan to rigorously validate repairs through PCT-SM. Moreover, the process is refined by the Spectrum of Prompts (SoP) technique, which iteratively optimizes prompt variations to improve remediation effectiveness. Experimental evaluations indicate that our approach yields higher remediation success rates and more robust testing performance compared to conventional methods, with the SoP component exhibiting prominent repair outcomes.References
Ahmad, B., Thakur, S., Tan, B., Karri, R., and Pearce, H. (2024). On hardware security bug code fixes by prompting large language models. IEEE Transactions on Information Forensics and Security.
Britton, T., Jeng, L., Carver, G., Cheak, P., and Katzenellenbogen, T. (2013). Reversible debugging software. Judge Bus. School, Univ. Cambridge, Cambridge, UK, Tech. Rep, 229.
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., et al. (2020). Language models are few-shot learners. arXiv preprint arXiv:2005.14165.
Bui, Q.-C., Scandariato, R., and Ferreyra, N. E. D. (2022). Vul4j: A dataset of reproducible java vulnerabilities geared towards the study of program repair techniques. In 2022 IEEE/ACM 19th International Conference on Mining Software Repositories (MSR), pages 464–468.
Chen, B., Zhang, Z., Langrené, N., and Zhu, S. (2023). Unleashing the potential of prompt engineering in large language models: a comprehensive review. arXiv preprint arXiv:2310.14735.
Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. D. O., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., et al. (2021). Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374.
Fan, Z., Gao, X., Mirchev, M., Roychoudhury, A., and Tan, S. H. (2023). Automated repair of programs from large language models. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pages 1469–1481. IEEE.
Gu, Z., Barr, E. T., Hamilton, D. J., and Su, Z. (2010). Has the bug really been fixed? In Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering-Volume 1, pages 55–64.
Kulsum, U., Zhu, H., Xu, B., and d’Amorim, M. (2024). A case study of llm for automated vulnerability repair: Assessing impact of reasoning and patch validation feedback. In Proceedings of the 1st ACM International Conference on AI-Powered Software, pages 103–111.
Lawler, E. L. and Wood, D. E. (1966). Branch-and-bound methods: A survey. Operations research, 14(4):699–719.
Le, T. K., Alimadadi, S., and Ko, S. Y. (2024). A study of vulnerability repair in javascript programs with large language models. In Companion Proceedings of the ACM on Web Conference 2024, pages 666–669.
Lelis, C. A. S. (2025). Sop experiments results data. [link]. Accessed: 2025-07-29.
Liu, P., Wang, H., Zheng, C., and Zhang, Y. (2024). Prompt fix: Vulnerability automatic repair technology based on prompt engineering. In 2024 International Conference on Computing, Networking and Communications (ICNC), pages 116–120. IEEE.
Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., and Neubig, G. (2023). Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9):1–35.
Morrison, D. R., Jacobson, S. H., Sauppe, J. J., and Sewell, E. C. (2016). Branch-and-bound algorithms: A survey of recent advances in searching, branching, and pruning. Discrete Optimization, 19:79–102.
Nong, Y., Aldeen, M., Cheng, L., Hu, H., Chen, F., and Cai, H. (2024). Chain-of-thought prompting of large language models for discovering and fixing software vulnerabilities. arXiv preprint arXiv:2402.17230.
Pearce, H., Tan, B., Ahmad, B., Karri, R., and Dolan-Gavitt, B. (2023). Examining zero-shot vulnerability repair with large language models. In 2023 IEEE Symposium on Security and Privacy (SP), pages 2339–2356. IEEE.
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al. (2019). Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
Vaswani, A. (2017). Attention is all you need. Advances in Neural Information Processing Systems.
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q. V., Zhou, D., et al. (2022). Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837.
Weimer, W., Nguyen, T., Le Goues, C., and Forrest, S. (2009). Automatically finding patches using genetic programming. In 2009 IEEE 31st International Conference on Software Engineering, pages 364–374. IEEE.
Wu, Y., Jiang, N., Pham, H. V., Lutellier, T., Davis, J., Tan, L., Babkin, P., and Shah, S. (2023). How effective are neural networks for fixing security vulnerabilities. In Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis, pages 1282–1294.
Xia, C. S. and Zhang, L. (2023). Keep the conversation going: Fixing 162 out of 337 bugs for $0.42 each using chatgpt. arXiv preprint arXiv:2304.00385.
Zhang, L., Zou, Q., Singhal, A., Sun, X., and Liu, P. (2024a). Evaluating large language models for real-world vulnerability repair in c/c++ code. In Proceedings of the 10th ACM International Workshop on Security and Privacy Analytics, pages 49–58.
Zhang, Q., Fang, C., Xie, Y., Ma, Y., Sun, W., and Chen, Y. Y. Z. (2024b). A systematic literature review on large language models for automated program repair. arXiv preprint arXiv:2405.01466.
Zhang, Q., Zhang, T., Zhai, J., Fang, C., Yu, B., Sun, W., and Chen, Z. (2023a). A critical review of large language model on software engineering: An example from chatgpt and automated program repair. arXiv preprint arXiv:2310.08879.
Zhang, Z., Chen, C., Liu, B., Liao, C., Gong, Z., Yu, H., Li, J., and Wang, R. (2023b). A survey on language models for code. arXiv preprint arXiv:2311.07989.
Britton, T., Jeng, L., Carver, G., Cheak, P., and Katzenellenbogen, T. (2013). Reversible debugging software. Judge Bus. School, Univ. Cambridge, Cambridge, UK, Tech. Rep, 229.
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., et al. (2020). Language models are few-shot learners. arXiv preprint arXiv:2005.14165.
Bui, Q.-C., Scandariato, R., and Ferreyra, N. E. D. (2022). Vul4j: A dataset of reproducible java vulnerabilities geared towards the study of program repair techniques. In 2022 IEEE/ACM 19th International Conference on Mining Software Repositories (MSR), pages 464–468.
Chen, B., Zhang, Z., Langrené, N., and Zhu, S. (2023). Unleashing the potential of prompt engineering in large language models: a comprehensive review. arXiv preprint arXiv:2310.14735.
Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. D. O., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., et al. (2021). Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374.
Fan, Z., Gao, X., Mirchev, M., Roychoudhury, A., and Tan, S. H. (2023). Automated repair of programs from large language models. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pages 1469–1481. IEEE.
Gu, Z., Barr, E. T., Hamilton, D. J., and Su, Z. (2010). Has the bug really been fixed? In Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering-Volume 1, pages 55–64.
Kulsum, U., Zhu, H., Xu, B., and d’Amorim, M. (2024). A case study of llm for automated vulnerability repair: Assessing impact of reasoning and patch validation feedback. In Proceedings of the 1st ACM International Conference on AI-Powered Software, pages 103–111.
Lawler, E. L. and Wood, D. E. (1966). Branch-and-bound methods: A survey. Operations research, 14(4):699–719.
Le, T. K., Alimadadi, S., and Ko, S. Y. (2024). A study of vulnerability repair in javascript programs with large language models. In Companion Proceedings of the ACM on Web Conference 2024, pages 666–669.
Lelis, C. A. S. (2025). Sop experiments results data. [link]. Accessed: 2025-07-29.
Liu, P., Wang, H., Zheng, C., and Zhang, Y. (2024). Prompt fix: Vulnerability automatic repair technology based on prompt engineering. In 2024 International Conference on Computing, Networking and Communications (ICNC), pages 116–120. IEEE.
Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., and Neubig, G. (2023). Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9):1–35.
Morrison, D. R., Jacobson, S. H., Sauppe, J. J., and Sewell, E. C. (2016). Branch-and-bound algorithms: A survey of recent advances in searching, branching, and pruning. Discrete Optimization, 19:79–102.
Nong, Y., Aldeen, M., Cheng, L., Hu, H., Chen, F., and Cai, H. (2024). Chain-of-thought prompting of large language models for discovering and fixing software vulnerabilities. arXiv preprint arXiv:2402.17230.
Pearce, H., Tan, B., Ahmad, B., Karri, R., and Dolan-Gavitt, B. (2023). Examining zero-shot vulnerability repair with large language models. In 2023 IEEE Symposium on Security and Privacy (SP), pages 2339–2356. IEEE.
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al. (2019). Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
Vaswani, A. (2017). Attention is all you need. Advances in Neural Information Processing Systems.
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q. V., Zhou, D., et al. (2022). Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837.
Weimer, W., Nguyen, T., Le Goues, C., and Forrest, S. (2009). Automatically finding patches using genetic programming. In 2009 IEEE 31st International Conference on Software Engineering, pages 364–374. IEEE.
Wu, Y., Jiang, N., Pham, H. V., Lutellier, T., Davis, J., Tan, L., Babkin, P., and Shah, S. (2023). How effective are neural networks for fixing security vulnerabilities. In Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis, pages 1282–1294.
Xia, C. S. and Zhang, L. (2023). Keep the conversation going: Fixing 162 out of 337 bugs for $0.42 each using chatgpt. arXiv preprint arXiv:2304.00385.
Zhang, L., Zou, Q., Singhal, A., Sun, X., and Liu, P. (2024a). Evaluating large language models for real-world vulnerability repair in c/c++ code. In Proceedings of the 10th ACM International Workshop on Security and Privacy Analytics, pages 49–58.
Zhang, Q., Fang, C., Xie, Y., Ma, Y., Sun, W., and Chen, Y. Y. Z. (2024b). A systematic literature review on large language models for automated program repair. arXiv preprint arXiv:2405.01466.
Zhang, Q., Zhang, T., Zhai, J., Fang, C., Yu, B., Sun, W., and Chen, Z. (2023a). A critical review of large language model on software engineering: An example from chatgpt and automated program repair. arXiv preprint arXiv:2310.08879.
Zhang, Z., Chen, C., Liu, B., Liao, C., Gong, Z., Yu, H., Li, J., and Wang, R. (2023b). A survey on language models for code. arXiv preprint arXiv:2311.07989.
Published
2025-09-01
How to Cite
LELIS, Claudio A. S.; MARCONDES, Cesar A. C.; FEALEY, Kevin.
Improving Source Code Security: A Novel Approach Using Spectrum of Prompts and Automated State Machine. In: BRAZILIAN SYMPOSIUM ON CYBERSECURITY (SBSEG), 25. , 2025, Foz do Iguaçu/PR.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2025
.
p. 513-529.
DOI: https://doi.org/10.5753/sbseg.2025.9822.
