404: Civility Not Found? Evaluating the Effectiveness of Small Language Models in Detecting Incivility in GitHub Conversations

Mário Patrício; Silas Eufrásio; Anderson Uchôa; Lincoln S. Rocha; Daniel Coutinho; Juliana Alves Pereira; Matheus Paixão; Alessandro Garcia

doi:10.5753/sbes.2025.9933

Mário Patrício UFC http://orcid.org/0009-0002-8569-8972
Silas Eufrásio UFC https://orcid.org/0009-0000-5421-6956
Anderson Uchôa UFC https://orcid.org/0000-0002-6847-5569
Lincoln S. Rocha UFC https://orcid.org/0000-0001-5402-8744
Daniel Coutinho PUC-Rio https://orcid.org/0000-0003-4226-2458
Juliana Alves Pereira PUC-Rio https://orcid.org/0000-0002-0799-2829
Matheus Paixão UECE https://orcid.org/0000-0002-1775-7259
Alessandro Garcia PUC-Rio https://orcid.org/0000-0001-5788-5215

DOI: https://doi.org/10.5753/sbes.2025.9933

Resumo

Context: Incivility in open-source software (OSS) platforms like GitHub can harm collaboration, discourage contributor participation, and impact code quality. Although current moderation tools based on Machine Learning (ML) and Natural Language Processing (NLP) offer some support, they often struggle to detect nuanced or implicit types of incivility. Goal: This study aims to assess the effectiveness of Small Language Models (SLMs) in detecting both coarse-grained (civil vs. uncivil) and fine-grained (specific types) incivility in GitHub conversations (issues and pull requests), and to understand how different prompting strategies influence detection performance. Method: We evaluate ten SLMs (3B-14B parameters) across five prompt strategies, on a labeled dataset with more than 6k GitHub conversations. We also compare the best-performing SLMs with five traditional ML models using two text-encoding techniques. Results: Our results reveal that SLMs perform well in detecting civil comments, but their effectiveness in detecting uncivil comments depends on model size. Models with 9B+ parameters (e.g., deepseek-r1, gpt-4o-mini) show improved performance on uncivil comments. For the fine-grained granularity, prompting strategy plays a critical role, with role-based prompting achieving the best results, particularly for implicit incivility types (e.g., Irony and Mocking), even when SLMs struggle with these types of incivility. Traditional ML models still perform well in explicit cases like Threat and Insulting. Conclusion: Our findings highlight the effectiveness of SLMs and prompt strategies in enhancing the detection of incivility within collaborative software development settings.

Palavras-chave: small language models, incivility, moderation, GitHub conversations, open-source projects

Referências

Caio Barbosa, Anderson Uchôa, Daniel Coutinho, Wesley KG Assunção, Anderson Oliveira, Alessandro Garcia, Baldoino Fonseca, Matheus Rabelo, José Eric Coelho, Eryka Carvalho, et al. 2023. Beyond the Code: Investigating the Effects of Pull Request Conversations on Design Decay. In 2023 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM). IEEE, 1–12.

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.

Victor R Basili-Gianluigi Caldiera and H Dieter Rombach. 1994. Goal question metric paradigm. Encyclopedia of software engineering 1, 528-532 (1994), 6. doi:~mvz/handouts/gqm.pdf

Benjamin Clavié, Alexandru Ciceu, Frederick Naylor, Guillaume Soulié, and Thomas Brightwell. 2023. Large language models in the workplace: A case study on prompt engineering for job type classification. In International conference on applications of natural language to information systems. Springer, 3–17.

Kevin Coe, Kate Kenski, and Stephen A Rains. 2014. Online and uncivil? Patterns and determinants of incivility in newspaper website comments. Journal of communication 64, 4 (2014), 658–679.

Thomas Davidson, Dana Warmsley, Michael Macy, and Ingmar Weber. 2017. Automated hate speech detection and the problem of offensive language. In Proceedings of the international AAAI conference on web and social media, Vol. 11. 512–515.

Ramtin Ehsani, Mia Mohammad Imran, Robert Zita, Kostadin Damevski, and Preetha Chatterjee. 2024. Incivility in open source projects: A comprehensive annotated dataset of locked github issue threads. In Proceedings of the 21st International Conference on Mining Software Repositories. 515–519.

Tom Fawcett. 2006. An introduction to ROC analysis. Pattern recognition letters 27, 8 (2006), 861–874.

Isabella Ferreira, Bram Adams, and Jinghui Cheng. 2022. How heated is it? Understanding GitHub locked issues. In Proceedings of the 19th International Conference on Mining Software Repositories. 309–320.

Isabella Ferreira, Jinghui Cheng, and Bram Adams. 2021. The" shut the f** k up" phenomenon: Characterizing incivility in open source code review discussions. Proceedings of the ACM on Human-Computer Interaction 5, CSCW2 (2021), 1–35.

Isabella Ferreira, Ahlaam Rafiq, and Jinghui Cheng. 2024. Incivility detection in open source code review and issue discussions. Journal of Systems and Software 209 (2024), 111935.

Milton Friedman. 1937. The use of ranks to avoid the assumption of normality implicit in the analysis of variance. Journal of the american statistical association 32, 200 (1937), 675–701.

Daviti Gachechiladze, Filippo Lanubile, Nicole Novielli, and Alexander Serebrenik. 2017. Anger and its direction in collaborative software development. In 2017 IEEE/ACM 39th International Conference on Software Engineering: New Ideas and Emerging Technologies Results Track (ICSE-NIER). IEEE, 11–14.

Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. 2024. Large language models for software engineering: A systematic literature review. ACM Transactions on Software Engineering and Methodology 33, 8 (2024), 1–79.

Ron Kohavi et al. 1995. A study of cross-validation and bootstrap for accuracy estimation and model selection. In Ijcai, Vol. 14. Montreal, Canada, 1137–1145.

Aobo Kong, Shiwan Zhao, Hao Chen, Qicheng Li, Yong Qin, Ruiqi Sun, Xin Zhou, EnzhiWang, and Xiaohang Dong. 2023. Better zero-shot reasoning with role-play prompting. arXiv preprint arXiv:2308.07702 (2023).

Aobo Kong, Shiwan Zhao, Hao Chen, Qicheng Li, Yong Qin, Ruiqi Sun, Xin Zhou, Enzhi Wang, and Xiaohang Dong. 2024. Better Zero-Shot Reasoning with Role-Play Prompting. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), Kevin Duh, Helena Gomez, and Steven Bethard (Eds.). Association for Computational Linguistics, Mexico City, Mexico, 4099–4113. DOI: 10.18653/v1/2024.naacl-long.228

Mika Mäntylä, Bram Adams, Giuseppe Destefanis, Daniel Graziotin, and Marco Ortu. 2016. Mining valence, arousal, and dominance: possibilities for detecting burnout and productivity?. In Proceedings of the 13th international conference on mining software repositories. 247–258.

Courtney Miller, Sophie Cohen, Daniel Klug, Bogdan Vasilescu, and Christian KaUstner. 2022. " Did you miss my comment or what?" understanding toxicity in open source discussions. In Proceedings of the 44th International Conference on Software Engineering. 710–722.

OpenAI. 2024. Gpt-4o mini: advancing costefficient intelligence. [link]. (Accessed on 14/03/2025).

John Pavlopoulos, Prodromos Malakasiotis, and Ion Androutsopoulos. 2017. Deep Learning for User Comment Moderation. In Proceedings of the First Workshop on Abusive Language Online, ALW@ACL 2017, Vancouver, BC, Canada, August 4, 2017, ZeerakWaseem,Wendy Hui Kyong Chung, Dirk Hovy, and Joel R. Tetreault (Eds.). Association for Computational Linguistics, 25–35. DOI: 10.18653/V1/W17-3004

Alec Radford, JeffreyWu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are unsupervised multitask learners. OpenAI blog 1, 8 (2019), 9.

Mohammad Masudur Rahman and Chanchal K Roy. 2014. An insight into the pull requests of github. In Proceedings of the 11th working conference on mining software repositories. 364–367.

Md Shamimur Rahman, Zadia Codabux, and Chanchal K Roy. 2024. Do words have power? understanding and fostering civility in code review discussion. Proceedings of the ACM on Software Engineering 1, FSE (2024), 1632–1655.

Matthew Renze. 2024. The Effect of Sampling Temperature on Problem Solving in Large Language Models. In Findings of the Association for Computational Linguistics: EMNLP 2024. Association for Computational Linguistics, 7346–7356. DOI: 10.18653/v1/2024.findings-emnlp.432

Pranab Sahoo, Ayush Kumar Singh, Sriparna Saha, Vinija Jain, Samrat Mondal, and Aman Chadha. 2024. A systematic survey of prompt engineering in large language models: Techniques and applications. arXiv preprint arXiv:2402.07927 (2024).

Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019).

Jaydeb Sarker, Asif Kamal Turzo, Ming Dong, and Amiangshu Bosu. 2023. Automated identification of toxic code reviews using toxicr. ACM Transactions on Software Engineering and Methodology 32, 5 (2023), 1–32.

Timo Schick and Hinrich Schütze. 2021. It’s Not Just Size That Matters: Small Language Models Are Also Few-Shot Learners. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6-11, 2021, Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tür, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou (Eds.). Association for Computational Linguistics, 2339–2352. DOI: 10.18653/V1/2021.NAACL-MAIN.185

Karen Sparck Jones. 1972. A statistical interpretation of term specificity and its application in retrieval. Journal of documentation 28, 1 (1972), 11–21.

Igor Steinmacher, Tayana Conte, Marco Aurélio Gerosa, and David Redmiles. 2015. Social barriers faced by newcomers placing their first contribution in open source software projects. In Proceedings of the 18th ACM conference on Computer supported cooperative work & social computing. 1379–1392.

Parastou Tourani, Bram Adams, and Alexander Serebrenik. 2017. Code of conduct in open source projects. In 2017 IEEE 24th international conference on software analysis, evolution and reengineering (SANER). IEEE, 24–33.

Claes Wohlin, Per Runeson, Martin Höst, Magnus C Ohlsson, Björn Regnell, and Anders Wesslén. 2012. Experimentation in software engineering. Springer Science & Business Media.

Ziqi Zhang, David Robinson, and Jonathan Tepper. 2018. Detecting hate speech on twitter using a convolution-gru based deep neural network. In European semantic web conference. Springer, 745–760.

Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex Smola. 2022. Automatic chain of thought prompting in large language models. arXiv preprint arXiv:2210.03493 (2022).

Denny Zhou, Nathanael Schärli, Le Hou, JasonWei, Nathan Scales, XuezhiWang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc V. Le, and Ed H. Chi. 2023. Least-to-Most Prompting Enables Complex Reasoning in Large Language Models. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net. [link]

Haiyi Zhu, Robert Kraut, and Aniket Kittur. 2012. Effectiveness of shared leadership in online communities. In Proceedings of the ACM 2012 conference on computer supported cooperative work. 407–416.