Evaluation of NeMo Guardrails as a Firewall for User-LLM Interaction
Abstract
This article analyzes NeMo Guardrails, a Large Language Model (LLM) whose role is to act as a firewall during user-LLM interactions. The objective is to evaluate its performance in detecting malicious attempts within this context, such as jailbreaking and prompt injection. The Do Not Answer dataset was used for the task of classifying the model into the Safe or Unsafe classes. The evaluation comprised a risk-category analysis, the computation of binary classification metrics, and the Compensation Rate, a new metric proposed in this study. The F1 score indicates a possible trade-off between Precision and Sensitivity.References
Alzaabi, F. R. and Mehmood, A. (2024). A review of recent advances, challenges, and opportunities in malicious insider threat detection using machine learning methods. IEEE Access, 12:30907–30927.
Deng, G., Liu, Y., Li, Y., Wang, K., Zhang, Y., Li, Z., Wang, H., Zhang, T., and Liu, Y. (2024). Masterkey: Automated jailbreaking of large language model chatbots. In Proceedings 2024 Network and Distributed System Security Symposium, NDSS 2024. Internet Society.
Derczynski, L., Galinkin, E., Martin, J., Majumdar, S., and Inie, N. (2024). garak: A framework for security probing large language models.
Esmradi, A., Yip, D. W., and Chan, C. F. (2023). A comprehensive survey of attack techniques, implementation, and mitigation strategies in large language models.
Feng, Y., Chen, Z., Kang, Z., Wang, S., Zhu, M., Zhang, W., and Chen, W. (2024). Jailbreaklens: Visual analysis of jailbreak attacks against large language models.
Ghosh, S., Varshney, P., Sreedhar, M. N., Padmakumar, A., Rebedea, T., Varghese, J. R., and Parisien, C. (2025). Aegis2.0: A diverse ai safety dataset and risks taxonomy for alignment of llm guardrails.
Greshake, K., Abdelnabi, S., Mishra, S., Endres, C., Holz, T., and Fritz, M. (2023). Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection.
Gupta, P., Yau, L. Q., Low, H. H., Lee, I.-S., Lim, H. M., Teoh, Y. X., Koh, J. H., Liew, D. W., Bhardwaj, R., Bhardwaj, R., and Poria, S. (2024). Walledeval: A comprehensive safety evaluation toolkit for large language models.
Jiang, H., Li, S., and Wang, M. (2024). Controlnet: An advanced firewall for retrieval–augmented generation systems.
Wang, Y., Li, H., Han, X., Nakov, P., and Baldwin, T. (2023). Do-not-answer: A dataset for evaluating safeguards in llms. arXiv preprint arXiv:2308.13387.
Xu, X., Li, M., Tao, C., Shen, T., Cheng, R., Li, J., Xu, C., Tao, D., and Zhou, T. (2024). A survey on knowledge distillation of large language models.
Zhang, Z., Lei, L., Wu, L., Sun, R., Huang, Y., Long, C., Liu, X., Lei, X., Tang, J., and Huang, M. (2024). Safetybench: Evaluating the safety of large language models.
Deng, G., Liu, Y., Li, Y., Wang, K., Zhang, Y., Li, Z., Wang, H., Zhang, T., and Liu, Y. (2024). Masterkey: Automated jailbreaking of large language model chatbots. In Proceedings 2024 Network and Distributed System Security Symposium, NDSS 2024. Internet Society.
Derczynski, L., Galinkin, E., Martin, J., Majumdar, S., and Inie, N. (2024). garak: A framework for security probing large language models.
Esmradi, A., Yip, D. W., and Chan, C. F. (2023). A comprehensive survey of attack techniques, implementation, and mitigation strategies in large language models.
Feng, Y., Chen, Z., Kang, Z., Wang, S., Zhu, M., Zhang, W., and Chen, W. (2024). Jailbreaklens: Visual analysis of jailbreak attacks against large language models.
Ghosh, S., Varshney, P., Sreedhar, M. N., Padmakumar, A., Rebedea, T., Varghese, J. R., and Parisien, C. (2025). Aegis2.0: A diverse ai safety dataset and risks taxonomy for alignment of llm guardrails.
Greshake, K., Abdelnabi, S., Mishra, S., Endres, C., Holz, T., and Fritz, M. (2023). Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection.
Gupta, P., Yau, L. Q., Low, H. H., Lee, I.-S., Lim, H. M., Teoh, Y. X., Koh, J. H., Liew, D. W., Bhardwaj, R., Bhardwaj, R., and Poria, S. (2024). Walledeval: A comprehensive safety evaluation toolkit for large language models.
Jiang, H., Li, S., and Wang, M. (2024). Controlnet: An advanced firewall for retrieval–augmented generation systems.
Wang, Y., Li, H., Han, X., Nakov, P., and Baldwin, T. (2023). Do-not-answer: A dataset for evaluating safeguards in llms. arXiv preprint arXiv:2308.13387.
Xu, X., Li, M., Tao, C., Shen, T., Cheng, R., Li, J., Xu, C., Tao, D., and Zhou, T. (2024). A survey on knowledge distillation of large language models.
Zhang, Z., Lei, L., Wu, L., Sun, R., Huang, Y., Long, C., Liu, X., Lei, X., Tang, J., and Huang, M. (2024). Safetybench: Evaluating the safety of large language models.
Published
2025-09-01
How to Cite
ALVES, Marcos Guilherme D. de Oliveira; CASTRO, João Victor F. de; LIMA, Jean Phelipe de Oliveira; AZAMBUJA, Antônio João Gonçalves de; OLIVEIRA, Leonardo Barbosa; SOARES, Anderson da Silva.
Evaluation of NeMo Guardrails as a Firewall for User-LLM Interaction. In: BRAZILIAN SYMPOSIUM ON CYBERSECURITY (SBSEG), 25. , 2025, Foz do Iguaçu/PR.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2025
.
p. 226-238.
DOI: https://doi.org/10.5753/sbseg.2025.10665.
