Topic-Based Text Segmentation in Portuguese Using BERTimbau

Abstract


In this work, we explore text segmentation for Portuguese using the BERTimbau model, with datasets derived from machine translation and online news sources. We obtained Pk = 6.89 for an in-domain evaluation, but worse results in out-of-domain evaluations, highlighting the importance of a diverse training set to improve generalization across multiple domains.

Keywords: text segmentation, natural language processing, Portuguese datasets, BERTimbau

References

Arnold, S., Schneider, R., Cudré-Mauroux, P., Gers, F. A., and Löser, A. (2019). Sector: A neural model for coherent topic segmentation and classification. Transactions of the Association for Computational Linguistics, 7:169–184. DOI: 10.1162/tacl_a_00261

Beeferman, D., Berger, A. L., and Lafferty, J. D. (1999). Statistical models for text segmentation. Machine Learning, 34:177–210. DOI: 10.1023/A%3A1007506220214

Cardoso, P. C., Pardo, T. A., and Taboada, M. (2017). Subtopic annotation and automatic segmentation for news texts in brazilian portuguese. Corpora, 12(1):23–54. DOI: 10.3366/COR.2017.0108

Devlin, J., Chang, M., Lee, K., and Toutanova, K. (2018). BERT: pre-training of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805. DOI: 10.48550/arXiv.1810.04805

Francisco, O. J. (2018). Recuperação de informação em atas de reunião utilizando segmentação textual e extração de tópicos. Dissertação de mestrado, Universidade Federal de São Carlos, Sorocaba. [link]

Gklezakos, D. C., Misiak, T., and Bishop, D. (2024). Treeseg: Hierarchical topic segmentation of large transcripts. arXiv preprint arXiv:2407.12028. DOI: 10.48550/arXiv.2407.12028

Hearst, M. A. (1997). Text tiling: Segmenting text into multi-paragraph subtopic passages. Computational linguistics, 23(1):33–64. [link]

Koshorek, O., Cohen, A., Mor, N., Rotman, M., and Berant, J. (2018). Text segmentation as a supervised learning task. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 469–473. DOI: 10.18653/v1/N18-2075

Pevzner, L. and Hearst, M. A. (2002). A critique and improvement of an evaluation metric for text segmentation. Computational Linguistics, 28(1):19–36. DOI: 10.1162/089120102317341756

Retkowski, F. and Waibel, A. (2024). From text segmentation to smart chaptering: A novel benchmark for structuring video transcriptions. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 406–419. [link]

Souza, F., Nogueira, R., and Lotufo, R. (2023). Bert models for brazilian portuguese: Pretraining, evaluation and tokenization analysis. Applied Soft Computing, 149:110901. DOI: 10.1016/j.asoc.2023.110901

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. (2023). Attention is all you need. DOI: 10.48550/arXiv.1706.03762

Yu, H., Deng, C., Zhang, Q., Liu, J., Chen, Q., and Wang, W. (2023). Improving long document topic segmentation models with enhanced coherence modeling. In Bouamor, H., Pino, J., and Bali, K., editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 5592–5605, Singapore. Association for Computational Linguistics. DOI: 10.18653/v1/2023.emnlp-main.341
Published
2024-11-17
DA SILVA, Luciano A. C.; RODRIGUES, Maiara S. F.; ARCHANJO, Adriana P.; PESSOA, Luis; SILVA, Miguel L.; DE ALMEIDA, Thiago F.; SILVEIRA, Leonardo. Topic-Based Text Segmentation in Portuguese Using BERTimbau. In: BRAZILIAN SYMPOSIUM IN INFORMATION AND HUMAN LANGUAGE TECHNOLOGY (STIL), 15. , 2024, Belém/PA. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2024 . p. 32-36. DOI: https://doi.org/10.5753/stil.2024.245080.