Segmentação Textual Baseada em Tópicos em Português Utilizando BERTimbau

Luciano A. C. da Silva; Maiara S. F. Rodrigues; Adriana P. Archanjo; Luis Pessoa; Miguel L. Silva; Thiago F. de Almeida; Leonardo Silveira

doi:10.5753/stil.2024.245080

Luciano A. C. da Silva USP http://orcid.org/0009-0002-2061-9903
Maiara S. F. Rodrigues CPQD https://orcid.org/0009-0006-6138-8258
Adriana P. Archanjo CPQD https://orcid.org/0000-0001-9503-194X
Luis Pessoa CPQD https://orcid.org/0009-0000-6290-4476
Miguel L. Silva CPQD https://orcid.org/0009-0002-9411-4465
Thiago F. de Almeida CPQD https://orcid.org/0009-0004-4528-9351
Leonardo Silveira PUC-Campinas https://orcid.org/0000-0002-4468-6812

DOI: https://doi.org/10.5753/stil.2024.245080

Resumo

Neste trabalho, exploramos a segmentação textual para o português utilizando o modelo BERTimbau, com bases de dados construídas usando tradução automática e a partir de notícias online. Obtivemos P_k = 6,89 para uma avaliação dentro do domínio, mas resultados piores em avaliações fora do domínio, destacando a importância de uma base de treinamento diversificada para melhorar a generalização em múltiplos domínios.

Palavras-chave: segmentação textual, processamento de linguagem natural, datasets em português, BERTimbau

Referências

Arnold, S., Schneider, R., Cudré-Mauroux, P., Gers, F. A., and Löser, A. (2019). Sector: A neural model for coherent topic segmentation and classification. Transactions of the Association for Computational Linguistics, 7:169–184. DOI: 10.1162/tacl_a_00261

Beeferman, D., Berger, A. L., and Lafferty, J. D. (1999). Statistical models for text segmentation. Machine Learning, 34:177–210. DOI: 10.1023/A%3A1007506220214

Cardoso, P. C., Pardo, T. A., and Taboada, M. (2017). Subtopic annotation and automatic segmentation for news texts in brazilian portuguese. Corpora, 12(1):23–54. DOI: 10.3366/COR.2017.0108

Devlin, J., Chang, M., Lee, K., and Toutanova, K. (2018). BERT: pre-training of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805. DOI: 10.48550/arXiv.1810.04805

Francisco, O. J. (2018). Recuperação de informação em atas de reunião utilizando segmentação textual e extração de tópicos. Dissertação de mestrado, Universidade Federal de São Carlos, Sorocaba. [link]

Gklezakos, D. C., Misiak, T., and Bishop, D. (2024). Treeseg: Hierarchical topic segmentation of large transcripts. arXiv preprint arXiv:2407.12028. DOI: 10.48550/arXiv.2407.12028

Hearst, M. A. (1997). Text tiling: Segmenting text into multi-paragraph subtopic passages. Computational linguistics, 23(1):33–64. [link]

Koshorek, O., Cohen, A., Mor, N., Rotman, M., and Berant, J. (2018). Text segmentation as a supervised learning task. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 469–473. DOI: 10.18653/v1/N18-2075

Pevzner, L. and Hearst, M. A. (2002). A critique and improvement of an evaluation metric for text segmentation. Computational Linguistics, 28(1):19–36. DOI: 10.1162/089120102317341756

Retkowski, F. and Waibel, A. (2024). From text segmentation to smart chaptering: A novel benchmark for structuring video transcriptions. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 406–419. [link]

Souza, F., Nogueira, R., and Lotufo, R. (2023). Bert models for brazilian portuguese: Pretraining, evaluation and tokenization analysis. Applied Soft Computing, 149:110901. DOI: 10.1016/j.asoc.2023.110901

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. (2023). Attention is all you need. DOI: 10.48550/arXiv.1706.03762

Yu, H., Deng, C., Zhang, Q., Liu, J., Chen, Q., and Wang, W. (2023). Improving long document topic segmentation models with enhanced coherence modeling. In Bouamor, H., Pino, J., and Bali, K., editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 5592–5605, Singapore. Association for Computational Linguistics. DOI: 10.18653/v1/2023.emnlp-main.341