Topic Modeling for the Legal Case Retrieval Task

Abstract


This article presents a topic-based approach to the problem of legal case retrieval. The method consists of two phases: filtering and ranking. In the first phase, a topic modeling technique is applied to the entire dataset to select an initial set of candidate cases for each query. In the second phase, a ranking function is used to produce an ordered list of relevant cases for the given query. Experimental results obtained using three different ranking functions and data collections in different languages indicate that the proposed approach is competitive. This is due to the strong correlation observed in our experiments between the topics of a query document and the topics of relevant legal cases. In fact, our approach achieved higher precision values than the ones reported from the recently held Competition on Legal Information Extraction/Entailment (COLIEE) 2023, competing with groups from around the world.

Keywords: Topic Modeling, IR, Legal Cases, Legal Case Retrieval

References

Chalkidis, I. et al. (2020). Legal-bert: The muppets straight out of law school.

Devlin, J. et al. (2019). BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT, pages 4171–4186.

Grootendorst, M. (2022). Bertopic: Neural topic modeling with a class-based tf-idf procedure.

Jalilifard, A. et al. (2021). Semantic sensitive tf-idf to determine word relevance in documents. In Advances in Computing and Network Communications: Proceedings of CoCoNet, pages 327–337.

Le, Q. and , T. M. (2014). Distributed representations of sentences and documents. In Proceedings of the 31st International Conference on International Conference on Machine Learning - ICML, page II–1188–II–1196.

Mandal, A. et al. (2021). Unsupervised approaches for measuring textual similarity between legal court case reports. Artif. Intell. Law, 29(3):417–451.

McInnes, L. and Healy, J. (2018). Umap: Uniform manifold approximation and projection for dimension reduction.

McInnes, L. et al. (2017). hdbscan: Hierarchical density based clustering. J. Open Source Softw., 2:205.

Nanda, R. et al. (2017). Legal information retrieval using topic clustering and neural networks. In 4th Competition on Legal Information Extraction and Entailment (COLIEE), pages 68–78.

Park, L. A. et al. (2009). The sensitivity of latent dirichlet allocation for information retrieval. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD, pages 176–188.

Rabelo, J. et al. (2022). Semantic-based classification of relevant case law. In New Frontiers in Artificial Intelligence - JSAI-isAI, pages 84–95.

Sansone, C. and Sperlí, G. (2022). Legal information retrieval systems: State-of-the-art and open issues. Information Systems, 106:101967.

Silveira, R. et al. (2021). Topic modelling of legal documents via legal-bert1. In Proceedings http://ceur-ws.org ISSN, 1613:0073.

Vianna, D. and Moura, E. (2022). Organizing portuguese legal documents through topic discovery. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, page 3388–3392.

Vianna, D., Moura, E., and Silva, A. (2023). A topic discovery approach for unsupervised organization of legal document collections. Artificial Intelligence and Law, pages 1–30.
Published
2023-09-25
PEREIRA NOVAES, Luisa; VIANNA, Daniela; DA SILVA, Altigran Soares. Topic Modeling for the Legal Case Retrieval Task. In: BRAZILIAN SYMPOSIUM ON DATABASES (SBBD), 38. , 2023, Belo Horizonte/MG. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2023 . p. 128-140. ISSN 2763-8979. DOI: https://doi.org/10.5753/sbbd.2023.232576.