Assessment of text clustering approaches for legal documents
Abstract
The judicial system is composed of numerous court documents. These documents may contain relevant information that supports decision-making in future processes. However, collecting this information is not a trivial task. This article proposes the use of clustering techniques to group similar court lawsuits and facilitate the collection of information. In this way, different approaches were evaluated for identifying the most appropriate to perform this task. The approaches were applied to a database composed of 1515 facts of initial petitions. These approaches were evaluated using internal metrics and texts of the grouped court lawsuits. The results showed that the best approach to grouping court lawsuits is composed of the K-Means algorithm and the TF-IDF representation technique in combination with the PCA technique.
References
Aletras, N., Tsarapatsanis, D., Preotiuc-Pietro, D., and Lampos, V. (2016). Predicting judicial decisions of the european court of human rights: A natural language processing perspective. PeerJ Computer Science, 2:e93.
Amine, A., Elberrichi, Z., and Simonet, M. (2010). Evaluation of text clustering methods using wordnet. Int. Arab J. Inf. Technol., 7(4):349–357.
Berkhin, P. (2006). A survey of clustering data mining techniques. In Grouping multidimensional data, pages 25–71. Springer.
Chen, B., Li, Y., Zhang, S., Lian, H., and He, T. (2019). A deep learning method for judicial decision support. In 2019 IEEE 19th International Conference on Software Quality, Reliability and Security Companion (QRS-C), pages 145–149. IEEE.
Conrad, J. G., Al-Kofahi, K., Zhao, Y., and Karypis, G. (2005). Effective document In Proceedings of the 10th clustering for large heterogeneous law firm collections. international conference on Artificial intelligence and law, pages 177–187.
de Colla Furquim, L. O. and De Lima, V. L. S. (2012). Clustering and categorization of brazilian portuguese legal documents. In International Conference on Computational Processing of the Portuguese Language, pages 272–283. Springer.
Fan, B., Liu, T., Hu, H., and Du, X. (2010). Law text clustering based on referential relations. In 2010 Fifth Annual ChinaGrid Conference, pages 60–66. IEEE.
Kachappilly, D. and Wagh, R. (2018). Similarity analysis of court judgments usingclustering of case citation data: a study. International Journal of Engineering & Technology, 7(2):855–858.
Kodinariya, T. M. and Makwana, P. R. (2013). Review on determining number of cluster in k-means clustering. International Journal, 1(6):90–95.
Kowsrihawat, K., Vateekul, P., and Boonkwan, P. (2018). Predicting judicial decisions of criminal cases from thai supreme court using bi-directional gru with attention mechanism. In 2018 5th Asian Conference on Defense Technology (ACDT), pages 50–55. IEEE.
Liu, T., Liu, S., Chen, Z., and Ma, W.-Y. (2003). An evaluation on feature selection for text clustering. In Proceedings of the 20th international conference on machine learning (ICML-03), pages 488–495.
Lv, B., Hou, W., Liu, G., Gao, J., Yuan, X., Li, P., and Chen, Z. (2018). A deep cfs model for text clustering. In 2018 IEEE International Conference on Internet of Things (iThings) and IEEE Green Computing and Communications (GreenCom) and IEEE Cyber, Physical and Social Computing (CPSCom) and IEEE Smart Data (SmartData), pages 132–137. IEEE.
Polpinij, J., Bheganan, P., Luaphol, B., Sibunruang, C., and Namee, K. (2020). Identifying of decision components in thai civil case decision by text classification technique. In International Conference on Computing and Information Technology, pages 11–20. Springer.
Poudyal, P., Gonçalves, T., and Quaresma, P. (2019). Using clustering techniques to identify arguments in legal documents. In ASAIL@ ICAIL.
Raghav, K., Reddy, P. B., Reddy, V. B., and Reddy, P. K. (2015). Text and citations based cluster analysis of legal judgments. In International conference on mining intelligence and knowledge exploration, pages 449–459. Springer.
Raghuveer, K. (2012). Legal documents clustering using latent dirichlet allocation. IAES Int. J. Artif. Intell, 2(1):34–37.
Rosca, C., Covrig, B., Goanta, C., van Dijck, G., and Spanakis, G. (2020). Return of the AI: An Analysis of Legal Research on Artificial Intelligence Using Topic Modeling. CEUR-WS. org.
Thammaboosadee, S., Watanapa, B., and Charoenkitkarn, N. (2012). A framework of multi-stage classifier for identifying criminal law sentences. Procedia Computer Science, 13:53–59.
Von Luxburg, U. (2007). A tutorial on spectral clustering. Statistics and Computing, 17(4):395–416.
Wang, X. and Xu, Y. (2019). An improved index for clustering validation based on silhouette index and calinski-harabasz index. In IOP Conference Series: Materials Science and Engineering, volume 569, page 052024. IOP Publishing.
Xiao, G., Chow, E., Chen, H., Mo, J., Guo, J., and Gong, Z. (2017). Chinese questions classification in the law domain. In 2017 IEEE 14th International Conference on eBusiness Engineering (ICEBE), pages 214–219. IEEE.
