Guardrail: Uma Abordagem Modular para Sistemas de Segurança em Inteligência Artificial Generativa
Resumo
Guardrail é um framework open-source para proteger modelos de linguagem (LLMs) contra ataques como prompt injection e jailbreaks. Adota uma arquitetura modular inspirada em mixture of experts, combinando filtros leves e semânticos. Avaliado em 1000 prompts, obteve F1-score de 0,9844 (modo paralelo) e 0,9711 (sequencial), superando PromptGuard e LLM-Guard. Os resultados demonstram a eficácia da combinação de módulos especializados na segurança de LLMs.Referências
Bick, A., Blandin, A., and Deming, D. J. (2024). The Rapid Adoption of Generative AI. Technical report, Federal Reserve Bank of St. Louis.
Das, B. C., Amini, M. H., and Wu, Y. (2025). Security and privacy challenges of large language models: A survey. ACM Computing Surveys, 57(6):1–39.
Greshake, K., Abdelnabi, S., Mishra, S., Endres, C., Holz, T., and Fritz, M. (2023). Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection. In Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security, pages 79–90.
Liu, Y., Jia, Y., Geng, R., Jia, J., and Gong, N. Z. (2024). Formalizing and benchmarking prompt injection attacks and defenses. In 33rd USENIX Security Symposium (USENIX Security 24), pages 1831–1847.
McKinsey & Company (2024). 65% das empresas usam Gen AI no mundo. [link]. Accessed: 2025-04-10.
Piet, J., Alrashed, M., Sitawarin, C., Chen, S., Wei, Z., Sun, E., Alomair, B., and Wagner, D. (2024). Jatmo: Prompt injection defense by task-specific finetuning. In European Symposium on Research in Computer Security, pages 105–124. Springer.
Rebedea, T., Dinu, R., Sreedhar, M., Parisien, C., and Cohen, J. (2023). NeMo Guardrails: A toolkit for controllable and safe LLM applications with programmable rails. arXiv preprint arXiv:2310.10501.
Sechel, S. (2024). Adversarial testing and prompt injection attacks using the embedding similarity approximation method.
Sharma, R. K., Gupta, V., and Grossman, D. (2024). SPML: A DSL for Defending Language Models Against Prompt Attacks.
Suo, X. et al. (2024). Signed-prompt: A new approach to prevent prompt injection attacks against llm-integrated applications. arXiv preprint.
The Alan Turing Institute (2024). Indirect Prompt Injection: Generative AI’s Greatest Security Flaw. [link]. Accessed: 2025-04-10.
TI Inside (2024). IA generativa se tornará uma indústria de US$ 100 bilhões até 2026. [link]. Accessed: 2025-04-10.
Zhang, X., Zhang, C., Li, T., Huang, Y., Jia, X., Hu, M., Zhang, J., Liu, Y., Ma, S., and Shen, C. (2025). JailGuard: A Universal Detection Framework for Prompt-based Attacks on LLM Systems. ACM Transactions on Software Engineering and Methodology.
Zhou, Y. et al. (2023). Prompt Injection Attack against LLM-integrated Applications. arXiv preprint arXiv:2306.05499. [link].
Das, B. C., Amini, M. H., and Wu, Y. (2025). Security and privacy challenges of large language models: A survey. ACM Computing Surveys, 57(6):1–39.
Greshake, K., Abdelnabi, S., Mishra, S., Endres, C., Holz, T., and Fritz, M. (2023). Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection. In Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security, pages 79–90.
Liu, Y., Jia, Y., Geng, R., Jia, J., and Gong, N. Z. (2024). Formalizing and benchmarking prompt injection attacks and defenses. In 33rd USENIX Security Symposium (USENIX Security 24), pages 1831–1847.
McKinsey & Company (2024). 65% das empresas usam Gen AI no mundo. [link]. Accessed: 2025-04-10.
Piet, J., Alrashed, M., Sitawarin, C., Chen, S., Wei, Z., Sun, E., Alomair, B., and Wagner, D. (2024). Jatmo: Prompt injection defense by task-specific finetuning. In European Symposium on Research in Computer Security, pages 105–124. Springer.
Rebedea, T., Dinu, R., Sreedhar, M., Parisien, C., and Cohen, J. (2023). NeMo Guardrails: A toolkit for controllable and safe LLM applications with programmable rails. arXiv preprint arXiv:2310.10501.
Sechel, S. (2024). Adversarial testing and prompt injection attacks using the embedding similarity approximation method.
Sharma, R. K., Gupta, V., and Grossman, D. (2024). SPML: A DSL for Defending Language Models Against Prompt Attacks.
Suo, X. et al. (2024). Signed-prompt: A new approach to prevent prompt injection attacks against llm-integrated applications. arXiv preprint.
The Alan Turing Institute (2024). Indirect Prompt Injection: Generative AI’s Greatest Security Flaw. [link]. Accessed: 2025-04-10.
TI Inside (2024). IA generativa se tornará uma indústria de US$ 100 bilhões até 2026. [link]. Accessed: 2025-04-10.
Zhang, X., Zhang, C., Li, T., Huang, Y., Jia, X., Hu, M., Zhang, J., Liu, Y., Ma, S., and Shen, C. (2025). JailGuard: A Universal Detection Framework for Prompt-based Attacks on LLM Systems. ACM Transactions on Software Engineering and Methodology.
Zhou, Y. et al. (2023). Prompt Injection Attack against LLM-integrated Applications. arXiv preprint arXiv:2306.05499. [link].
Publicado
01/09/2025
Como Citar
BOCAMPAGNI, Fábio Alves; MENASCHÉ, Daniel Sadoc; KOGEYAMA, Renato.
Guardrail: Uma Abordagem Modular para Sistemas de Segurança em Inteligência Artificial Generativa. In: SIMPÓSIO BRASILEIRO DE CIBERSEGURANÇA (SBSEG), 25. , 2025, Foz do Iguaçu/PR.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2025
.
p. 1074-1081.
DOI: https://doi.org/10.5753/sbseg.2025.11506.
