Doclass: open-source software to support document labeling and classification

Marcelo Inuzuka; Hugo do Nascimento; Fernando Almeida; Bruno Barros; Walid Jradi

doi:10.5753/kdmile.2020.11965

Marcelo Inuzuka Universidade Federal de Goiás
Hugo do Nascimento Universidade Federal de Goiás
Fernando Almeida Universidade Federal de Goiás
Bruno Barros Universidade Federal de Goiás
Walid Jradi Ultimatum Tecnologia Jurídica

DOI: https://doi.org/10.5753/kdmile.2020.11965

Resumo

This article introduces Doclass, a free and open-source software for the Web that aims to assist in labeling and classifying large sets of documents. The research involved a design science research methodology, guided by the real demands of a legal text processing company. The architecture, several design decisions and the current development stage of the software are presented. Preliminary user experiments for evaluating interactive document labeling are described. As a result, the first version of a system with an architecture composed of a mobile frontend that communicates with a backend through a REST API was published, with satisfactory performance evaluation by the applicant. Other results involve the use of active learning techniques to reduce human effort when performing the classification of documents, as well as the Uncertainty strategy to choose the document to be labeled. The effectiveness of the stop criterion for the active learning technique based on confidence level was tested and proved unsatisfactory, remaining as a future work.

Palavras-chave: document classification, active learning, annotation tool, document labeling, legal text

Referências

Brooks, M., Amershi, S., Lee, B., Drucker, S. M., Kapoor, A., and Simard, P. FeatureInsight: Visual support for error-driven feature ideation in text classification. In 2015 IEEE Conference on Visual Analytics Science and Technology (VAST). IEEE, Chicago, IL, USA, pp. 105–112, 2015.

Fielding, R. T. Architectural styles and the design of network-based software architectures. Ph.D. thesis, University of California, Irvine, 2000.

Lacerda, D. P., Dresch, A., Proença, A., and Antunes Júnior, J. A. V. Design Science Research: método de pesquisa para a engenharia de produção. Gestão & Produção 20 (4): 741–761, Nov., 2013.

Lewis, D. D. and Gale, W. A. A sequential algorithm for training text classifiers. In SIGIR ’94, B. W. Croft and C. J. van Rijsbergen (Eds.). Springer London, London, pp. 3–12, 1994.

Luhn, H. P. A statistical approach to mechanized encoding and searching of literary information. IBM Journal of Research and Development 1 (4): 309–317, 1957.

Mayor, S. and Pant, B. Document classification using support vector machine. International Journal of Engineering Science and Technology vol. 4, pp. 1741–1745, 04, 2012.

Settles, B. Active learning literature survey. Tech. rep., University of Wisconsin-Madison Department of Computer Sciences, 2010.

Settles, B. Closing the Loop: Fast, Interactive Semi-Supervised Annotation With Queries on Features and Instances. Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing vol. 1, pp. 12, 2011.

Zhu, J., Wang, H., Hovy, E., and Ma, M. Confidence-based stopping criteria for active learning for data annotation. ACM Transactions on Speech and Language Processing 6 (3): 3:1–3:24, Apr., 2010.