CordelSextilha.BR: A Benchmark for Poetic Form in Brazilian Cordel Verse Generation
Abstract
We introduce CordelSextilha.BR, the first benchmark for automatic generation of Brazilian cordel sextilhas. We compile 1,519 public-domain stanzas, remove the final line, and gather expert ratings for rhyme, meter, coherence, and “cordelisticity”. Rule-based metrics (RhythmAcc, RhymeAcc) align with human judgements; GPT-4o and LLaMA-3.2 (8B) reach about 80% rhyme and about 75% rhythm accuracy. The corpus, metrics and baselines enable RLHF research in low-resource poetry.References
Al-Rashid, F. and Ali, A. M. (2025). Automatic detection of quantitative metres in classical arabic poetry using bi-lstms. Frontiers in Digital Humanities, 2:1–12.
Araujo, A. P. R., Carlos, C. A. S. L., Sampaio, J. C. R., and Vieira, R. F. (2019). Digital heritage: Academic research in brazil in the last five years. In Proceedings of the 27th CIPA International Symposium “Documenting the Past for a Better Future”, volume XLII-2/W15, pages 109–116.
Campos, M. B., Tommaselli, A. M. G., Ivánová, I., and Billen, R. (2015). Data product specification proposal for architectural heritage documentation with photogrammetric techniques: A case study in brazil. Remote Sensing, 7(10):13337–13363.
Carletta, J. (1996). Assessing agreement on classification tasks: The kappa statistic. Computational Linguistics, 22(2):249–254.
Chen, J., Patel, R., and Lee, M. (2024). Steering large language models toward poetic constraints via reinforcement learning from human feedback. arXiv preprint.
Gonçalo Oliveira, H., Cardoso, A., and Pereira, F. C. (2007). Tra-la-lyrics: An approach to generate text based on rhythm. In Proceedings of the 4th International Joint Workshop on Computational Creativity, pages 147–155, London, UK.
Greene, E., Bodrumlu, T., and Knight, K. (2010). Automatic analysis of rhythmic poetry with applications to generation and translation. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 524–533.
Instituto do Patrimônio Histórico e Artístico Nacional (IPHAN) (2018). Literatura de cordel becomes brazilian intangible cultural heritage. [link]. Accessed 23 Jun 2025.
Koziev, I. (2025). Automated evaluation of meter and rhyme in russian generative and human-authored poetry. arXiv preprint.
Martins, D. L., Lemos, D. L. d. S., Oliveira, L. F. R., Siqueira, J., do Carmo, D., and Medeiros, V. N. (2023). Information organization and representation in digital cultural heritage in brazil: Systematic mapping of information infrastructure in digital collections for data science applications. Journal of the Association for Information Science and Technology, 74(6):707–726.
Mittmann, A., von Wangenheim, A., and dos Santos, A. L. (2018). Aoidos: A system for the automatic scansion of poetry written in portuguese. In Proceedings of the 19th International Conference on Computational Linguistics and Intelligent Text Processing (CICLing 2018), volume 10762 of Lecture Notes in Computer Science, pages 611–628. Springer.
Mortensen, D. R., Dalmia, S., and Littell, P. (2018). Epitran: Precision g2p for many languages. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), pages 2710–2714. European Language Resources Association (ELRA).
Nogueira, A. G. R. (2018). Patrimonialização da literatura de cordel e os desafios de salvaguarda. Anos 90, 25(48):181–212.
OpenAI (2023). Gpt-4o: Technical report. [link]. Accessed: June 2025.
Pinheiro, F. F. A. (2023). O cordel contemporâneo: estudo do verso e poética. Dissertação de mestrado, Universidade Federal de São Paulo, São Paulo.
Plecháč, P., Cinková, S., Kolář, R., Šeła, A., Sisto, M. D., Nugues, L., Haider, T., and Kočnik, N. (2024). Poetree: Poetry treebanks in czech, english, french, german, hungarian, italian, portuguese, russian, slovenian and spanish. Research Data Journal for the Humanities and Social Sciences, 9:1–17.
Rigonatto, M. (2025). Redondilha: Concept and examples. [link]. Accessed 23 Jun 2025.
Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., and Lample, G. (2023). Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
Walsh, M., Preus, A., and Antoniak, M. (2024). Sonnet or not, bot? evaluating large language models on poetic form. arXiv preprint.
Wang, Z., You, K., Chen, J., and Zhao, S. (2016). Chinese poetry generation with planning-based neural network. In Proceedings of the 26th International Conference on Computational Linguistics (COLING), pages 1051–1060.
Yakovenko, N. (2020). Rhythm and rhyme evaluation toolkit for rap lyrics. [link]. Accessed 23 Jun 2025.
Araujo, A. P. R., Carlos, C. A. S. L., Sampaio, J. C. R., and Vieira, R. F. (2019). Digital heritage: Academic research in brazil in the last five years. In Proceedings of the 27th CIPA International Symposium “Documenting the Past for a Better Future”, volume XLII-2/W15, pages 109–116.
Campos, M. B., Tommaselli, A. M. G., Ivánová, I., and Billen, R. (2015). Data product specification proposal for architectural heritage documentation with photogrammetric techniques: A case study in brazil. Remote Sensing, 7(10):13337–13363.
Carletta, J. (1996). Assessing agreement on classification tasks: The kappa statistic. Computational Linguistics, 22(2):249–254.
Chen, J., Patel, R., and Lee, M. (2024). Steering large language models toward poetic constraints via reinforcement learning from human feedback. arXiv preprint.
Gonçalo Oliveira, H., Cardoso, A., and Pereira, F. C. (2007). Tra-la-lyrics: An approach to generate text based on rhythm. In Proceedings of the 4th International Joint Workshop on Computational Creativity, pages 147–155, London, UK.
Greene, E., Bodrumlu, T., and Knight, K. (2010). Automatic analysis of rhythmic poetry with applications to generation and translation. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 524–533.
Instituto do Patrimônio Histórico e Artístico Nacional (IPHAN) (2018). Literatura de cordel becomes brazilian intangible cultural heritage. [link]. Accessed 23 Jun 2025.
Koziev, I. (2025). Automated evaluation of meter and rhyme in russian generative and human-authored poetry. arXiv preprint.
Martins, D. L., Lemos, D. L. d. S., Oliveira, L. F. R., Siqueira, J., do Carmo, D., and Medeiros, V. N. (2023). Information organization and representation in digital cultural heritage in brazil: Systematic mapping of information infrastructure in digital collections for data science applications. Journal of the Association for Information Science and Technology, 74(6):707–726.
Mittmann, A., von Wangenheim, A., and dos Santos, A. L. (2018). Aoidos: A system for the automatic scansion of poetry written in portuguese. In Proceedings of the 19th International Conference on Computational Linguistics and Intelligent Text Processing (CICLing 2018), volume 10762 of Lecture Notes in Computer Science, pages 611–628. Springer.
Mortensen, D. R., Dalmia, S., and Littell, P. (2018). Epitran: Precision g2p for many languages. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), pages 2710–2714. European Language Resources Association (ELRA).
Nogueira, A. G. R. (2018). Patrimonialização da literatura de cordel e os desafios de salvaguarda. Anos 90, 25(48):181–212.
OpenAI (2023). Gpt-4o: Technical report. [link]. Accessed: June 2025.
Pinheiro, F. F. A. (2023). O cordel contemporâneo: estudo do verso e poética. Dissertação de mestrado, Universidade Federal de São Paulo, São Paulo.
Plecháč, P., Cinková, S., Kolář, R., Šeła, A., Sisto, M. D., Nugues, L., Haider, T., and Kočnik, N. (2024). Poetree: Poetry treebanks in czech, english, french, german, hungarian, italian, portuguese, russian, slovenian and spanish. Research Data Journal for the Humanities and Social Sciences, 9:1–17.
Rigonatto, M. (2025). Redondilha: Concept and examples. [link]. Accessed 23 Jun 2025.
Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., and Lample, G. (2023). Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
Walsh, M., Preus, A., and Antoniak, M. (2024). Sonnet or not, bot? evaluating large language models on poetic form. arXiv preprint.
Wang, Z., You, K., Chen, J., and Zhao, S. (2016). Chinese poetry generation with planning-based neural network. In Proceedings of the 26th International Conference on Computational Linguistics (COLING), pages 1051–1060.
Yakovenko, N. (2020). Rhythm and rhyme evaluation toolkit for rap lyrics. [link]. Accessed 23 Jun 2025.
Published
2025-09-29
How to Cite
BARBOSA, Bryan K. S.; BARBOSA, Marcela Y. A..
CordelSextilha.BR: A Benchmark for Poetic Form in Brazilian Cordel Verse Generation. In: NATIONAL MEETING ON ARTIFICIAL AND COMPUTATIONAL INTELLIGENCE (ENIAC), 22. , 2025, Fortaleza/CE.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2025
.
p. 736-747.
ISSN 2763-9061.
DOI: https://doi.org/10.5753/eniac.2025.14067.
