Doclass: open-source software to support document labeling and classification
Resumo
This article introduces Doclass, a free and open-source software for the Web that aims to assist in labeling and classifying large sets of documents. The research involved a design science research methodology, guided by the real demands of a legal text processing company. The architecture, several design decisions and the current development stage of the software are presented. Preliminary user experiments for evaluating interactive document labeling are described. As a result, the first version of a system with an architecture composed of a mobile frontend that communicates with a backend through a REST API was published, with satisfactory performance evaluation by the applicant. Other results involve the use of active learning techniques to reduce human effort when performing the classification of documents, as well as the Uncertainty strategy to choose the document to be labeled. The effectiveness of the stop criterion for the active learning technique based on confidence level was tested and proved unsatisfactory, remaining as a future work.
Referências
Fielding, R. T. Architectural styles and the design of network-based software architectures. Ph.D. thesis, University of California, Irvine, 2000.
Lacerda, D. P., Dresch, A., Proença, A., and Antunes Júnior, J. A. V. Design Science Research: método de pesquisa para a engenharia de produção. Gestão & Produção 20 (4): 741–761, Nov., 2013.
Lewis, D. D. and Gale, W. A. A sequential algorithm for training text classifiers. In SIGIR ’94, B. W. Croft and C. J. van Rijsbergen (Eds.). Springer London, London, pp. 3–12, 1994.
Luhn, H. P. A statistical approach to mechanized encoding and searching of literary information. IBM Journal of Research and Development 1 (4): 309–317, 1957.
Mayor, S. and Pant, B. Document classification using support vector machine. International Journal of Engineering Science and Technology vol. 4, pp. 1741–1745, 04, 2012.
Settles, B. Active learning literature survey. Tech. rep., University of Wisconsin-Madison Department of Computer Sciences, 2010.
Settles, B. Closing the Loop: Fast, Interactive Semi-Supervised Annotation With Queries on Features and Instances. Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing vol. 1, pp. 12, 2011.
Zhu, J., Wang, H., Hovy, E., and Ma, M. Confidence-based stopping criteria for active learning for data annotation. ACM Transactions on Speech and Language Processing 6 (3): 3:1–3:24, Apr., 2010.