DEBISS: a Corpus of Individual, Semi-structured and Spoken Debates

  • Klaywert Danillo Ferreira de Souza UFCG
  • David Eduardo Pereira UFCG
  • Cláudio E. C. Campelo UFCG
  • Larissa Lucena Vasconcelos IFPB

Abstract


Debating is essential in daily life — whether in academic or professional settings, casual conversations, political forums, or online discussions. The range of debate applications is broad; therefore, their structures and formats can vary significantly. Developing corpora that account for these variations is challenging. The scarcity of debate corpora in the current state of the art, particularly for other languages beyond English, is notable. For this reason, this research proposes the DEBISS corpus, a collection of spoken and individual debates in Portuguese with semi-structured features. The corpus has broad applicability across Natural Language Processing tasks, including speech-to-text, speaker diarization, argument mining, and debate quality evaluation.

References

Abbott, R., Ecker, B., Anand, P., and Walker, M. (2016). Internet argument corpus 2.0: An sql schema for dialogic social media and the corpora to go with it. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pages 4445–4452.

Bentahar, J., Moulin, B., and Bélanger, M. (2010). A taxonomy of argumentation models used for knowledge representation. Artificial Intelligence Review, 33(3):211–259.

Boltužić, F. and Šnajder, J. (2016). Fill the gap! analyzing implicit premises between claims from online debates. In Reed, C., editor, Proceedings of the Third Workshop on Argument Mining (ArgMining2016), pages 124–133, Berlin, Germany. Association for Computational Linguistics.

Carvalho, P., Sarmento, L., Teixeira, J., and Silva, M. J. (2011). Liars and saviors in a sentiment annotated corpus of comments to political debates. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 564–568.

Chakrabarty, T., Hidey, C., Muresan, S., McKeown, K., and Hwang, A. (2019). AMPERSAND: Argument mining for PERSuAsive oNline discussions. In Inui, K., Jiang, J., Ng, V., and Wan, X., editors, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2933–2943, Hong Kong, China. Association for Computational Linguistics.

De Smedt, T. and Jaki, S. (2018). The polly corpus: Online political debate in germany. In of the 6th Conference on Computer-Mediated Communication (CMC) and Social Media Corpora (CMC-corpora 2018), page 33.

Durmus, E. and Cardie, C. (2019). A corpus for modeling user and language effects in argumentation on online debating. arXiv preprint arXiv:1906.11310.

Duthie, R., Budzynska, K., and Reed, C. (2016). Mining Ethos in Political Debate, volume 287 of Frontiers in Artificial Intelligence and Applications, pages 299–310. IOS Press, Netherlands. This research was supported in part by EPSRC in the UK under grant EP/M506497/1 and in part by the Polish National Science Centre under grant 2015/18/M/HS1/00620.

Habernal, I. and Gurevych, I. (2016). Which argument is more convincing? analyzing and predicting convincingness of web arguments using bidirectional LSTM. In Erk, K. and Smith, N. A., editors, Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1589–1599, Berlin, Germany. Association for Computational Linguistics.

Hautli-Janisz, A., Kikteva, Z., Siskou, W., Gorska, K., Becker, R., and Reed, C. (2022). Qt30: A corpus of argument and conflict in broadcast debate. In Proceedings of the 13th Language Resources and Evaluation Conference, pages 3291–3300. European Language Resources Association (ELRA).

Khodak, M., Saunshi, N., and Vodrahalli, K. (2017). A large selfannotated corpus for sarcasm. arXiv preprint arXiv:1704.05579.

Lai, M., Patti, V., Ruffo, G., and Rosso, P. (2018). Stance evolution and twitter interactions in an italian political debate. In Natural Language Processing and Information Systems: 23rd International Conference on Applications of Natural Language to Information Systems, NLDB 2018, Paris, France, June 13-15, 2018, Proceedings 23, pages 15–27. Springer.

Lima, P. L. and Campelo, C. E. (2024). Disfluency detection and removal in speech transcriptions via large language models. In Anais do XV Simpósio Brasileiro de Tecnologia da Informação e da Linguagem Humana, pages 227–235, Porto Alegre, RS, Brasil. SBC.

Mancini, E., Ruggeri, F., Galassi, A., and Torroni, P. (2022). Multimodal argument mining: A case study in political debates. In Lapesa, G., Schneider, J., Jo, Y., and Saha, S., editors, Proceedings of the 9th Workshop on Argument Mining, pages 158–170, Online and in Gyeongju, Republic of Korea. International Conference on Computational Linguistics.

Mestre, R., Milicin, R., Middleton, S. E., Ryan, M., Zhu, J., and Norman, T. J. (2021). M-arg: Multimodal argument mining dataset for political debates with audio and transcripts. In Al-Khatib, K., Hou, Y., and Stede, M., editors, Proceedings of the 8th Workshop on Argument Mining, pages 78–88, Punta Cana, Dominican Republic. Association for Computational Linguistics.

Pereira, D., Simão, D., and Claúdio, C. (2025). Debiss-arg: An in depth data annotation protocol and corpus for argument mining in semi structured debates. In Anais do XVI Simpósio Brasileiro de Tecnologia da Informação e da Linguagem Humana. SBC.

Ruiz-Dolz, R., Nofre, M., Taulé, M., Heras, S., and García-Fornes, A. (2021). Vivesdebate: A new annotated multilingual corpus of argumentation in a debate tournament. Applied Sciences, 11(15):7160.

Sousa, J. P., Nascimento, R., Araujo, R., and Coelho, O. (2021). Não se perca no debate! mineração de argumentação em redes sociais. In Anais do X Brazilian Workshop on Social Network Analysis and Mining, pages 139–150, Porto Alegre, RS, Brasil. SBC.

Stranisci, M., De Leonardis, M., Bosco, C., and Patti, V. (2021). The expression of moral values in the twitter debate: a corpus of conversations. IJCoL. Italian Journal of Computational Linguistics, 7(7-1, 2):113–132.

Vrana, L. and Schneider, G. (2017). Saying whatever it takes: Creating and analyzing corpora from us presidential debate transcripts.
Published
2025-09-29
SOUZA, Klaywert Danillo Ferreira de; PEREIRA, David Eduardo; CAMPELO, Cláudio E. C.; VASCONCELOS, Larissa Lucena. DEBISS: a Corpus of Individual, Semi-structured and Spoken Debates. In: BRAZILIAN SYMPOSIUM IN INFORMATION AND HUMAN LANGUAGE TECHNOLOGY (STIL), 16. , 2025, Fortaleza/CE. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2025 . p. 580-587. DOI: https://doi.org/10.5753/stil.2025.37860.