Negotiating LLMs for Enhanced Hate Speech Classification and Interpretability
Resumo
Traditional hate speech classification frameworks rely on a single Large Language Model (LLM) operating in an isolated, single-turn decision-making process. However, this approach can suffer from limitations in handling nuanced linguistic phenomena such as sarcasm, ambiguity, and contextual shifts. To address these challenges, we introduce a multi-LLM negotiation framework, where two specialized models engage in iterative exchanges to refine the classification decision. The generator proposes a classification alongside a rationale, while the discriminator evaluates its credibility and requests adjustments until a consensus is reached. Experiments conducted on Twitter in Portuguese hate speech detection dataset, demonstrate that the negotiation-based approach using Sabizinho 3 and GPT 4.1 Nano with zero-shot, achieve competitive precision and recall. Furthermore, results indicate that this framework allow interpretability of the final classification.
Referências
Almeida, T. S., Abonizio, H., Nogueira, R., and Pires, R. (2024). Sabi\’a-2: A new generation of portuguese large language models. arXiv preprint arXiv:2403.09887.
Chase, H. (2022). Langchain. [link]. Accessed: 2025-06-18.
Chiu, K.-L., Collins, A., and Alexander, R. (2021). Detecting hate speech with gpt-3. arXiv preprint arXiv:2103.12407.
Fortuna, P., da Silva, J. R., Wanner, L., Nunes, S., et al. (2019a). A hierarchically-labeled portuguese hate speech dataset. In Proceedings of the third workshop on abusive language online, pages 94–104.
Fortuna, P., Rocha da Silva, J., Soler-Company, J., Wanner, L., and Nunes, S. (2019b). A hierarchically-labeled Portuguese hate speech dataset. In Roberts, S. T., Tetreault, J., Prabhakaran, V., and Waseem, Z., editors, Proceedings of the Third Workshop on Abusive Language Online, pages 94–104, Florence, Italy. Association for Computational Linguistics.
Ghorbanpour, F., Dementieva, D., and Fraser, A. (2025). Can prompting llms unlock hate speech detection across languages? a zero-shot and few-shot study. arXiv preprint arXiv:2505.06149.
Guo, K., Hu, A., Mu, J., Shi, Z., Zhao, Z., Vishwamitra, N., and Hu, H. (2023). An investigation of large language models for real-world hate speech detection. In 2023 International Conference on Machine Learning and Applications (ICMLA), pages 1568–1573. IEEE.
Kumar, M. et al. (2025). Exploring hate speech detection: challenges, resources, current research and future directions. Multimedia Tools and Applications, pages 1–37.
Oliveira, A., Cecote, T., Silva, P., Gertrudes, J., Freitas, V., and Luz, E. (2023). How good is chatgpt for detecting hate speech in portuguese? In Anais do XIV Simpósio Brasileiro de Tecnologia da Informação e da Linguagem Humana, pages 94–103, Porto Alegre, RS, Brasil. SBC.
OpenAI (2025). Models Overview.
Park, S., Kim, J., Jin, S., Park, S., and Han, K. (2024). Predict: Multi-agent-based debate simulation for generalized hate speech detection. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 20963–20987.
Rawat, A., Kumar, S., and Samant, S. S. (2024). Hate speech detection in social media: Techniques, recent trends, and future challenges. Wiley Interdisciplinary Reviews: Computational Statistics, 16(2):e1648.
Silva, F. and Freitas, L. (2022). Brazilian portuguese hate speech classification using bertimbau. The International FLAIRS Conference Proceedings, 35.
Sun, X., Li, X., Zhang, S., Wang, S., Wu, F., Li, J., Zhang, T., and Wang, G. (2023). Sentiment analysis through llm negotiations.
Vidgen, B. and Derczynski, L. (2020). Directions in abusive language training data, a systematic review: Garbage in, garbage out. PLOS ONE, 15(12):e0243300.
