Assessing the Impact of Stemming Algorithms Applied to Brazilian Legislative Documents Retrieval

Ellen Souza; Gyovana Moriyama; Douglas Vitório; André C. P. L. F. de Carvalho; Nádia Félix; Hidelberg O. Albuquerque; Adriano L. I. Oliveira

doi:10.5753/stil.2021.17802

Ellen Souza UFRPE / USP
Gyovana Moriyama USP
Douglas Vitório UFRPE / UFPE
André C. P. L. F. de Carvalho USP
Nádia Félix USP / UFG
Hidelberg O. Albuquerque UFRPE / UFPE
Adriano L. I. Oliveira UFPE

DOI: https://doi.org/10.5753/stil.2021.17802

Resumo

The main purpose of stemming is to reduce the inﬂected words into its root form or stem. Thus, words can be mapped to the same concept, improving the process of information retrieval, regarding its ability to index documents and to reduce data dimensionality. However, the efficiency of those algorithms varies according to different aspects. Also, studies in the field area reached contrasting conclusions. This work assesses the use of stemmers in the retrieval of legislative documents written in Portuguese. Four stemmers together with BM25 were evaluated in two legislative corpora from the Brazilian Chamber of Deputies. RSLP-S and Savoy stemmers showed the best improvements in the information retrieval pipeline.

Referências

Almeida, P. G. R. (2021). Uma jornada para um Parlamento inteligente: Câmara dos Deputados do Brasil. Red Información, 24.

Alvares, R. V., Garcia, A. C. B., and Ferraz, I. (2005). Stembr: A stemming algorithm for the brazilian portuguese language. In Bento, C., Cardoso, A., and Dias, G., editors, Progress in Artificial Intelligence, pages 693–701, Berlin, Heidelberg. Springer Berlin Heidelberg.

Chalkidis, I., Fergadiotis, M., Manginas, N., Katakalou, E., and Malakasiotis, P. (2021). Regulatory compliance through Doc2Doc information retrieval: A case arXiv preprint study in EU/UK legislation where text similarity has limitations. arXiv:2101.10726.

de Oliveira, R. A. and Colaço Júnior, M. (2017). Assessing the impact of stemIn Inming algorithms applied to judicial jurisprudence-an experimental analysis. ternational Conference on Enterprise Information Systems, volume 2, pages 99–105. SCITEPRESS.

Flores, F. N. and Moreira, V. P. (2016). Assessing the impact of stemming accuracy on information retrieval–a multilingual perspective. Information Processing & Management, 52(5):840–854.

Flores, F. N., Moreira, V. P., and Heuser, C. A. (2010). Assessing the impact of stemming accuracy on information retrieval. In International Conference on Computational Processing of the Portuguese Language, pages 11–20. Springer.

Friedman, M. (1937). The use of ranks to avoid the assumption of normality implicit in the analysis of variance. Journal of the american statistical association, 32(200):675–701.

Gomes, T. and Ladeira, M. (2020). A new conceptual framework for enhancing legal information retrieval at the brazilian superior court of justice. In Proceedings of the 12th International Conference on Management of Digital EcoSystems, page 26–29.

Hotho, A., Nürnberger, A., and Paaß, G. (2005). A Brief Survey of Text Mining. Journal for Computational Linguistics and Language Technology, pages 1–37.

Kamphuis, C., de Vries, A. P., Boytsov, L., and Lin, J. (2020). Which BM25 do you mean? In Advances in Information A large-scale reproducibility study of scoring variants. Retrieval, pages 28–34.

Lv, Y. and Zhai, C. (2011). When documents are very long, BM25 fails! In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, page 1103–1104.

Manning, C. D., Raghavan, P., and Schütze, H. (2008). Introduction to Information Retrieval. Cambridge University Press, USA.

Maxwell, K. T. and Schafer, B. (2008). Concept and context in legal information retrieval. Frontiers in Artificial Intelligence and Applications, 189:63–72.

Moral, C., de Antonio, A., Imbert, R., and Ramírez, J. (2014). A Survey of Stemming Algorithms in Information Retrieval. Information Research: An International Electronic Journal, 19(1)(n1):22.

N de Oliveira, R. A. and C Junior, M. (2018). Experimental analysis of stemming on jurisprudential documents retrieval. Information, 9(2):28.

Nemenyi, P. (1963). Distribution-free multiple comparisons. PhD thesis, Princeton University.

Orengo, V. M., Buriol, L. S., and Coelho, A. R. (2006). A study on the use of stemming for monolingual ad-hoc portuguese information retrieval. In Workshop of the CrossLanguage Evaluation Forum for European Languages, pages 91–98. Springer.

Orengo, V. M. and Huyck, C. R. (2001). A stemming algorithmm for the portuguese language. In SPIRE, volume 8, pages 186–193.

Porter, M. (1980). An algorithm for suffix stripping. Program, 40:211–218.

Robertson, S., Walker, S., Jones, S., Hancock-Beaulieu, M., and Gatford, M. (1994). Okapi at TREC-3. In TREC.

Robertson, S. and Zaragoza, H. (2009). The probabilistic relevance framework: BM25 and beyond. Foundations and Trends in Information Retrieval, 3:333–389.

Savoy, J. (2006). Light stemming approaches for the french, portuguese, german and hungarian languages. In SAC ’06: Proceedings of the 2006 ACM symposium on Applied computing, pages 1031–1035, New York, NY, USA. Association for Computing Machinery.

Trotman, A., Puurula, A., and Burgess, B. (2014). Improvements to BM25 and language models examined. ACM International Conference Proceeding Series, 27-28-Nove:58–65.