On the use of Query by Committee for Human-in-the-Loop Named Entity Recognition

Gabriel Corvino; Vitor Vasconcelos Oliveira; Angelo C. Mendes da Silva; Ricardo Marcondes Marcacini

doi:10.5753/kdmile.2022.227953

Gabriel Corvino Universidade de Brasília
Vitor Vasconcelos Oliveira Universidade de Brasília
Angelo C. Mendes da Silva Universidade de São Paulo
Ricardo Marcondes Marcacini Universidade de São Paulo

DOI: https://doi.org/10.5753/kdmile.2022.227953

Resumo

Named Entity Recognition is a relevant task for extracting information from textual data. Traditional methods for training NER models assume that humans annotate entities manually, identifying entities in predefined categories. This strategy presents a great human effort, mainly in more specific application domains. To address these challenges, we consider Human in the Loop (HITL), which can be understood as a set of strategies to incorporate human knowledge and experience into machine learning, while accelerating model training. In this paper, we investigate a classic method called Query by Committee (QBC), which helps to select informative instances for data labeling. Traditionally, QBC selects instances with a high level of disagreement between different models of a committee. We present heuristics for QBC relaxation to also consider some level of agreement. We showed that taking advantage of some committee agreement for pre-labeling of instances is promising to speed up human feedback and increase the training set. Experimental results showed that our method is able to preserve the performance of models compared to traditional QBC, while reducing human labeling effort.

Palavras-chave: Human in the Loop, Active Learning, Ensemble, Named Entity Recognition

Referências

Aggarwal, C. C., Kong, X., Gu, Q., Han, J., and Philip, S. Y. Active learning: A survey. In Data Classification. CRC Press, USA, pp. 599–634, 2014. Publisher Copyright: 2015 by Taylor & Francis Group LLC.

Alonso, O. Algorithms and techniques for quality control. In The Practice of Crowdsourcing. Springer, Cham, pp. 53–63, 2019.

Beluch, W. H., Genewein, T., Nürnberger, A., and Köhler, J. M. The power of ensembles for active learning in image classification. In Proceedings of the IEEE conference on computer vision and pattern recognition. IEEE/CVF, Salt Lake City, Utah, EUA, pp. 9368–9377, 2018.

Dagan, I. and Engelson, S. P. Committee-based sampling for training probabilistic classifiers. In Machine Learning Proceedings 1995. Morgan Kaufmann, San Francisco (CA), pp. 150–157, 1995.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1. Association for Computational Linguistics, Minneapolis, Minnesota, pp. 4171–4186, 2019.

Kumar, P. and Gupta, A. Active learning query strategies for classification, regression, and clustering: a survey. Journal of Computer Science and Technology 35 (4): 913–945, 2020.

Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., and Soricut, R. Albert: A lite bert for self-supervised learning of language representations. In 8th International Conference on Learning Representations. Open Review, Addis Ababa, Ethiopia, pp. 1–17, 2020.

Laws, F. and Schütze, H. Stopping criteria for active learning of named entity recognition. In Proceedings of the 22nd International Conference on Computational Linguistics. ACL, USA, pp. 465–472, 2008.

Lewis, D. D. and Gale, W. A. A sequential algorithm for training text classifiers. In SIGIR’94. Springer, London, UK, pp. 3–12, 1994.

Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov, V. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 , 2019.

Melville, P. and Mooney, R. J. Diverse ensembles for active learning. In Proceedings of the twenty-first international conference on Machine learning. Association for Computing Machinery, New York, NY, USA, pp. 74, 2004.

Monarch, R. M. Human-in-the-Loop Machine Learning: Active learning and annotation for human-centered AI. Manning, UK, 2021.

Pontiki, M., Galanis, D., Papageorgiou, H., Manandhar, S., and Androutsopoulos, I. Semeval-2015 task 12: Aspect based sentiment analysis. In Proceedings of the 9th international workshop on semantic evaluation. Association for Computational Linguistics, Denver, Colorado, pp. 486–495, 2015.

Ren, P., Xiao, Y., Chang, X., Huang, P.-Y., Li, Z., Gupta, B. B., Chen, X., and Wang, X. A survey of deep active learning. ACM Computing Surveys (CSUR) 54 (9): 1–40, 2021.

Sanh, V., Debut, L., Chaumond, J., and Wolf, T. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 , 2019.

Settles, B. Active learning literature survey. Technical Report 1648, University of Wisconsin–Madison, 2009.

Settles, B. and Craven, M. An analysis of active learning strategies for sequence labeling tasks. In proceedings of the 2008 conference on empirical methods in natural language processing. Association for Computational Linguistics, USA, pp. 1070–1079, 2008.

Seung, H. S., Opper, M., and Sompolinsky, H. Query by committee. In Proceedings of the Fifth Annual Workshop on Computational Learning Theory. Association for Computing Machinery, New York, NY, USA, pp. 287–294, 1992.

Song, B., Li, F., Liu, Y., and Zeng, X. Deep learning methods for biomedical named entity recognition: a survey and qualitative comparison. Briefings in Bioinformatics 22 (6): 282, 2021.

Souza, F., Nogueira, R., and Lotufo, R. Bertimbau: pretrained bert models for brazilian portuguese. In Brazilian conference on intelligent systems. Springer International Publishing, Cham, pp. 403–417, 2020.

Sun, C., Qiu, X., Xu, Y., and Huang, X. How to fine-tune bert for text classification? In China national conference on Chinese computational linguistics. Springer, Springer International Publishing, Cham, pp. 194–206, 2019.

Tedeschi, S., Maiorca, V., Campolungo, N., Cecconi, F., and Navigli, R. Wikineural: Combined neural and knowledge-based silver data creation for multilingual ner. In Findings of the Association for Computational Linguistics: EMNLP 2021. Association for Computational Linguistics, Punta Cana, Dominican Republic, pp. 2521–2533, 2021.

Wu, X., Xiao, L., Sun, Y., Zhang, J., Ma, T., and He, L. A survey of human-in-the-loop for machine learning. Future Generation Computer Systems vol. 135, pp. 364–381, 2022.

Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R. R., and Le, Q. V. Xlnet: Generalized autoregressive pretraining for language understanding. Advances in neural information processing systems vol. 32, pp. 11, 2019.

Zanzotto, F. M. Human-in-the-loop artificial intelligence. Journal of Artificial Intelligence Research vol. 64, pp. 243–252, 2019.

Zhao, Y., Xu, C., and Cao, Y. Research on query-by-committee method of active learning and application. In Advanced Data Mining and Applications, X. Li, O. R. Zaïane, and Z. Li (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, pp. 985–991, 2006.