A Textual Representation Based on Bag-of-Concepts and Thesaurus for Legal Information Retrieval

Wagner M. Costa; Glauco V. Pedrosa

doi:10.5753/kdmile.2022.227779

Wagner M. Costa Universidade de Brasília
Glauco V. Pedrosa Universidade de Brasília

DOI: https://doi.org/10.5753/kdmile.2022.227779

Resumo

The retrieval of similar textual documents is a challenging task for the legal area due to its peculiar language with unique characteristics. This paper presents a new approach, called BoC-Th, proposed to represent legal documents based on the Bag-of-Concept (BoC) approach, which generates concept through clustering word vectors generated from a basic neural network model, and compute the frequencies of these concept clusters to represent document vectors. The novel contribution of the BoC-Th is to generate weighted histograms of concepts defined from the distance of the word to its respective similar term within a thesaurus. The idea is to emphasize those words that have more significance for the context, thus generating more discriminative vectors. Experimental evaluations were performed by comparing the proposed approach with the traditional BoW and BoC approaches, both popular techniques for document representation. The proposed method obtained the best result among the evaluated techniques for retrieving judgments and jurisprudence documents. The BoC-Th increased the mAP (mean Average Precision) in 51% compared to the traditional BoC approach, while being up to 3.4 times faster than the traditional BoW representation.

Palavras-chave: textual representation, bag of concepts, text mining, word embeddings

Referências

Analytics Vidhya, N. An intuitive understanding of word embeddings: From count vectors to word2vec, 2017.

Castano, S., Falduti, M., Ferrara, A., and Montanelli, S. A knowledge-centered framework for exploration and retrieval of legal documents. Information Systems vol. 106, pp. 101842, 2022.

Dal Pont, T. R., Sabo, I. C., Hübner, J. F., and Rover, A. J. Impact of text specificity and size on word embeddings performance: An empirical evaluation in brazilian legal domain. Springer-Verlag, Berlin, Heidelberg, pp. 521–535, 2020.

de Araujo, P. H. L., de Campos, T. E., and aes Silva de Sousa, M. M. Inferring the source of official texts: Can svm beat ulmfit? In Computational Processing of the Portuguese Language: 14th International Conference, PROPOR 2020, Evora, Portugal, March 2–4, 2020, Proceedings. Springer-Verlag, Berlin, Heidelberg, pp. 76–86, 2020.

de Campos, T. E., de Araujo, P. H. L., Braz, F. A., and da Silva, N. C. VICTOR: a dataset for Brazilian legal documents classification. In Proceedings of the 12th Language Resources and Evaluation Conference. European Language Resources Association, Marseille, France, pp. 1449–1458, 2020.

Dhillon, I. S. and Modha, D. S. Concept decompositions for large sparse text data using clustering. Machine learning 42 (1): 143–175, 2001.

Kim, H. K., Kim, H., and Cho, S. Bag-of-concepts: Comprehending document representation through clustering words in distributed representation. Neurocomputing vol. 266, pp. 336–352, 2017.

Le, Q. and Mikolov, T. Distributed representations of sentences and documents. In Proceedings of the 31st International Conference on Machine Learning, E. P. Xing and T. Jebara (Eds.). Proceedings of Machine Learning Research, vol. 32. PMLR, Bejing, China, pp. 1188–1196, 2014.

Martins, V. and Silva, C. Text classification in law area: a systematic review. In Anais do IX Symposium on Knowledge Discovery, Mining and Learning. SBC, Porto Alegre, RS, Brasil, pp. 33–40, 2021.

Mikolov, T., Chen, K., Corrado, G., and Dean, J. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 , 2013.

Mourino Garcia, M. A., Perez Rodriguez, R., and Anido Rifon, L. E. Biomedical literature classification using encyclopedic knowledge: a wikipedia-based bag-of-concepts approach. PeerJ (San Francisco, CA) vol. 3, pp. e1279–e1279, 2015.

Noguti, M. Y., Vellasques, E., and Oliveira, L. S. Legal document classification: An application to law area prediction of petitions to public prosecution service. In 2020 International Joint Conference on Neural Networks, IJCNN 2020, Glasgow, United Kingdom, July 19-24, 2020. IEEE, pp. 1–8, 2020.

Renjit, S. and Idicula, S. M. Cusat nlp@ aila-fire2019: Similarity in legal texts using document level embeddings. In FIRE (Working Notes). pp. 25–30, 2019.

Silva, A. C. and Maia, L. C. G. The use of machine learning in the classification of electronic lawsuits: An application in the court of justice of minas gerais. In Intelligent Systems, R. Cerri and R. C. Prati (Eds.). Springer International Publishing, Cham, pp. 606–620, 2020.

Solihin, F., Budi, I., Aji, R. F., and Makarim, E. Advancement of information extraction use in legal documents. International Review of Law, Computers & Technology 35 (3): 322–351, 2021.

Yan, J. pp. 3069–3072. In L. LIU and M. T. ÖZSU (Eds.), Text Representation. Springer US, Boston, MA, pp. 3069–3072, 2009.