Operationalizing the Traffic Light Protocol for Native Portuguese Language Model Guardrails in Collaborative Workflows

Eduardo Alexandre de Amorim; Cleber Zanchettin

doi:10.5753/sbsc.2026.20118

Eduardo Alexandre de Amorim Universidade de São Paulo (USP) / Universidade Federal de Pernambuco (UFPE)
Cleber Zanchettin Universidade Federal de Pernambuco (UFPE)

DOI: https://doi.org/10.5753/sbsc.2026.20118

Resumo

The integration of Language Models (LMs) into collaborative environments, such as corporate chat platforms (e.g., Slack, Teams), shared document editors, and internal copilots, introduces distinct security risks. The primary threat evolves from direct user attacks to the manipulation of LMs to generate harmful content and the injection of adversarial inputs leveraging social engineering within trusted teams. While detection models for attempted harmful content exist, there is a distinct lack of operational frameworks for deploying them without disrupting the User Experience (UX). This gap is particularly critical for the Portuguese language, which lacks native safety resources with sufficient quality to differentiate cultural nuances from actual threats. A system that blocks legitimate technical discussions (false positives) or introduces noticeable latency disrupts collaboration, leading users to adopt "Shadow IT" solutions. This work proposes a reference architecture for operationalizing LM moderation specifically for Portuguese contexts using SecBERT as a pre-filtering layer. We define a comprehensive "Traffic Light" workflow for message routing, tunable decision thresholds, and distinct auditing roles to manage ambiguity in human communication. Furthermore, we conduct a feasibility analysis using a trace-driven simulation based on a dataset of 29,432 interactions adapted from widely adopted safety taxonomies. Results demonstrate that the proposed architecture processes requests with a P99 latency of 18.54ms. Crucially, we analyze the operational friction caused by complex "role-playing" prompts common in technical teams, proposing a mitigation strategy that reduces human intervention to 1.9% of interactions. These findings indicate the technical feasibility of operationalizing native, discriminative models to secure collaborative streams with minimal friction.

Palavras-chave: Collaborative Systems Security, LM Guardrails, Portuguese NLP Security, Secure Architecture, Feasibility Analysis

Referências

Bang, Y. et al. (2023). A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. arXiv preprint arXiv:2302.04023.

Gupta, M., Akiri, C., Aryal, K., Parker, E., and Praharaj, L. (2023). From chatgpt to threatgpt: Impact of generative ai in cybersecurity and privacy. IEEE Access, 11:80218–80245. Open Access.

Inan, H., Upasani, K., Chi, J., et al. (2023). Llama guard: Llm-based input-output safeguard for human-ai conversations. arXiv preprint arXiv:2312.06674.

Jiang, L., Hwang, J., Bhagavatula, C., et al. (2024). Wildteaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models. arXiv preprint arXiv:2406.18510.

Liu, Y. et al. (2023). Jailbreaking chatgpt via prompt engineering: An empirical study. arXiv preprint arXiv:2305.13860.

Presidência da República, Brasil (2018). Lei geral de proteção de dados pessoais (lgpd) – lei n° 13.709/2018.

Rababah, B. et al. (2024). Sok: Prompt hacking of large language models. arXiv preprint arXiv:2410.13901.

Rebedea, T., Dinu, R., Sreedhar, M., et al. (2023). Nemo guardrails: A toolkit for controllable and safe llm applications with programmable rails. arXiv preprint arXiv:2310.10501.

Sandhu, R. S., Coyne, E. J., Feinstein, H. L., and Youman, C. E. (1996). Role-based access control models. Computer, 29(2):38–47.

Souza, F., Nogueira, R., and Lotufo, R. (2020). Bertimbau: Pretrained bert models for brazilian portuguese. In Lecture Notes in Computer Science, pages 403–417. Springer.

Zhang, Z. and Jiexin, P. (2005). Delegation model for cscw based on rbac. In Proc. International Conference on Network Computing and Information Security, pages 367–371.