Guardrail: Uma Abordagem Modular para Sistemas de Segurança em Inteligência Artificial Generativa

Fábio Alves Bocampagni; Daniel Sadoc Menasché; Renato Kogeyama

doi:10.5753/sbseg.2025.11506

Fábio Alves Bocampagni UFRJ
Daniel Sadoc Menasché UFRJ
Renato Kogeyama Kearney

DOI: https://doi.org/10.5753/sbseg.2025.11506

Resumo

Guardrail é um framework open-source para proteger modelos de linguagem (LLMs) contra ataques como prompt injection e jailbreaks. Adota uma arquitetura modular inspirada em mixture of experts, combinando filtros leves e semânticos. Avaliado em 1000 prompts, obteve F1-score de 0,9844 (modo paralelo) e 0,9711 (sequencial), superando PromptGuard e LLM-Guard. Os resultados demonstram a eficácia da combinação de módulos especializados na segurança de LLMs.

Referências

Bick, A., Blandin, A., and Deming, D. J. (2024). The Rapid Adoption of Generative AI. Technical report, Federal Reserve Bank of St. Louis.

Das, B. C., Amini, M. H., and Wu, Y. (2025). Security and privacy challenges of large language models: A survey. ACM Computing Surveys, 57(6):1–39.

Greshake, K., Abdelnabi, S., Mishra, S., Endres, C., Holz, T., and Fritz, M. (2023). Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection. In Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security, pages 79–90.

Liu, Y., Jia, Y., Geng, R., Jia, J., and Gong, N. Z. (2024). Formalizing and benchmarking prompt injection attacks and defenses. In 33rd USENIX Security Symposium (USENIX Security 24), pages 1831–1847.

McKinsey & Company (2024). 65% das empresas usam Gen AI no mundo. [link]. Accessed: 2025-04-10.

Piet, J., Alrashed, M., Sitawarin, C., Chen, S., Wei, Z., Sun, E., Alomair, B., and Wagner, D. (2024). Jatmo: Prompt injection defense by task-specific finetuning. In European Symposium on Research in Computer Security, pages 105–124. Springer.

Rebedea, T., Dinu, R., Sreedhar, M., Parisien, C., and Cohen, J. (2023). NeMo Guardrails: A toolkit for controllable and safe LLM applications with programmable rails. arXiv preprint arXiv:2310.10501.

Sechel, S. (2024). Adversarial testing and prompt injection attacks using the embedding similarity approximation method.

Sharma, R. K., Gupta, V., and Grossman, D. (2024). SPML: A DSL for Defending Language Models Against Prompt Attacks.

Suo, X. et al. (2024). Signed-prompt: A new approach to prevent prompt injection attacks against llm-integrated applications. arXiv preprint.

The Alan Turing Institute (2024). Indirect Prompt Injection: Generative AI’s Greatest Security Flaw. [link]. Accessed: 2025-04-10.

TI Inside (2024). IA generativa se tornará uma indústria de US$ 100 bilhões até 2026. [link]. Accessed: 2025-04-10.

Zhang, X., Zhang, C., Li, T., Huang, Y., Jia, X., Hu, M., Zhang, J., Liu, Y., Ma, S., and Shen, C. (2025). JailGuard: A Universal Detection Framework for Prompt-based Attacks on LLM Systems. ACM Transactions on Software Engineering and Methodology.

Zhou, Y. et al. (2023). Prompt Injection Attack against LLM-integrated Applications. arXiv preprint arXiv:2306.05499. [link].

Guardrail: Uma Abordagem Modular para Sistemas de Segurança em Inteligência Artificial Generativa

Resumo

Referências

Artigos mais lidos do(s) mesmo(s) autor(es)