A Study About Ruling Summarization Evaluation

Abstract


Various evaluation metrics for text generation have been proposed in recent years. However, many questions have emerged about how well they can evaluate the accuracy and quality of the text generated. In this work, we study how some of the most popular text generation metrics behave when dealing with the text summarization task in the Portuguese legal domain. More specifically, we evaluated five metrics -- ROUGE, BERTScore, BARTScore, BLEURT, and MoverScore --, using a dataset containing 892 rulings from the Brazilian Superior Court of Justice. Each item in the dataset is composed of a ruling, which is the original legal document, and a syllabus, which corresponds to a manually generated summary of the original legal document. Our study revealed that, for the Brazilian legal domain, none of the metrics evaluated were capable of fully measuring the quality of manually generated summaries when compared with their original documents, and that, among the evaluated metrics, ROUGE and BERTScore presented the most promising results.
Keywords: Text Summarization, Evaluation Metrics, Legal Domain

References

Fabbri, A. R., Kryściński, W., McCann, B., Xiong, C., Socher, R., and Radev, D. (2021). Summeval: Re-evaluating summarization evaluation. Transactions of the Association for Computational Linguistics, 9:391–409.

Farzindar, A. and Lapalme, G. (2004). LetSum, an automatic legal text summarizing system. In Jurix, pages 11–18.

Feijó, D. d. V. and Moreira, V. P. (2019). Summarizing legal rulings: Comparative experiments. In Proceedings of the International Conference on Recent Advances in Natural Language Processing, RANLP 2019, pages 313–322.

Feijó, D. d. V. and Moreira, V. P. (2023). Improving abstractive summarization of legal rulings through textual entailment. Artificial Intelligence and Law, 31(1):91–113.

Guimarães, J. A. C. (2004). Elaboração de ementas jurisprudenciais: elementos teórico-metodológicos, volume 9. Subsecretaria de Divulgação e Editoração da Secretaria de Pesquisa e Informação Jurídicas do Centro de Estudos Judiciários.

Jain, D., Borah, M. D., and Biswas, A. (2021). Summarization of legal documents: Where are we now and the way forward. Computer Science Review, 40:100388.

Kryściński, W., Keskar, N. S., McCann, B., Xiong, C., and Socher, R. (2019). Neural text summarization: A critical evaluation. arXiv preprint arXiv:1908.08960.

Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., Stoyanov, V., and Zettlemoyer, L. (2020). BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pages 7871–7880.

Lin, C.-Y. (2004). ROUGE: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81.

Liu, Y. (2019). Fine-tune BERT for extractive summarization. CoRR, abs/1903.10318.

Pandya, V. (2019). Automatic text summarization of legal cases: A hybrid approach. In 5th International Conference on Advances in Computer Science and Information Technology (ACSTY-2019).

Polsley, S., Jhunjhunwala, P., and Huang, R. (2016). Casesummarizer: A system for automated summarization of legal texts. In COLING 2016, 26th International Conference on Computational Linguistics, Proceedings of the Conference System Demonstrations, pages 258–262.

Pu, A., Chung, H. W., Parikh, A. P., Gehrmann, S., and Sellam, T. (2021). Learning compact metrics for MT. In Conference on Empirical Methods in Natural Language Processing.

Sellam, T., Das, D., and Parikh, A. P. (2020). BLEURT: learning robust metrics for text generation. CoRR.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. (2017). Attention is all you need. CoRR, abs/1706.03762.

Yuan, W., Neubig, G., and Liu, P. (2021). Bartscore: Evaluating generated text as text generation. In Advances in Neural Information Processing Systems, pages 27263–27277.

Zhang, J., Zhao, Y., Saleh, M., and Liu, P. J. (2020). PEGASUS: pre-training with extracted gap-sentences for abstractive summarization. In Proceedings of the 37th International Conference on Machine Learning, pages 11328–11339.

Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q., and Artzi, Y. (2019). Bertscore: Evaluating text generation with BERT. CoRR.

Zhao, W., Peyrard, M., Liu, F., Gao, Y., Meyer, C. M., and Eger, S. (2019). Moverscore: Text generation evaluating with contextualized embeddings and earth mover distance. CoRR.
Published
2023-09-25
FELTRIN, Gustavo Rufino; VIANNA, Daniela; DA SILVA, Altigran. A Study About Ruling Summarization Evaluation. In: BRAZILIAN SYMPOSIUM ON DATABASES (SBBD), 38. , 2023, Belo Horizonte/MG. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2023 . p. 295-305. ISSN 2763-8979. DOI: https://doi.org/10.5753/sbbd.2023.232000.