Iterative machine learning applied to annotation of text datasets
Resumo
The purpose of this paper is to analyze the use of different machine learning approaches and algorithms to be integrated as an automated assistance on a tool to aid the creation of new annotated datasets. We evaluate how they scale in an environment without dedicated machine learning hardware. In particular, we study the impact over a dataset with few examples and one that is being constructed. We experiment using deep learning algorithms (Bert) and classical learning algorithms with a lower computational cost (W2V and Glove combined with RF and SVM). Our experiments show that deep learning algorithms have a performance advantage over classical techniques. However, deep learning algorithms have a high computational cost, making them inadequate to an environment with reduced hardware resources. Simulations using Active and Iterative machine learning techniques to assist the creation of new datasets are conducted. For these simulations, we use the classical learning algorithms because of their computational cost. The knowledge gathered with our experimental evaluation aims to support the creation of a tool for building new text datasets.
Referências
Boser, B. E., Guyon, I. M., Vapnik, V. N.: A training algorithm for optimal margin classifiers. In: Proceedings of the Fifth Annual Workshop on Computational Learning Theory. pp. 144–152. COLT ’92, ACM, New York, NY, USA (1992). https://doi.org/10.1145/130385.130401, http://doi.acm.org/10.1145/130385.130401
Devlin, J., Chang, M. W., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). pp. 4171–4186. Association for Computational Linguistics, Minneapolis, Minnesota (Jun 2019), https://www.aclweb.org/anthology/N19-1423.
Dudley, J. J., Kristensson, P. O.: A review of user interface design for interactive machine learning. ACM Transactions on Interactive Intelligent Systems (TiiS) 8(2), 1–37 (2018)
El-Assady, M., Sevastjanova, R., Gipp, B., Keim, D., Collins, C.: Nerex: Named-entity relationship exploration in multi-party conversations. Computer Graphics Forum 36(3), 213–225 (2017). https://doi.org/10.1111/cgf.13181, https://onlinelibrary.wiley.com/doi/abs/10.1111/cgf.13181
Hartmann, N. S., Fonseca, E. R., Shulby, C. D., Treviso, M. V., Rodrigues, J. S., Aluisio, S. M.: Portuguese word embeddings: Evaluating on word analogies and natural language tasks. In: XI Brazilian Symposium in Information and Human Language Technology and Collocated Events. pp. 122–131. SBC, Sociedade Brasileira de Computação, Uberlândia, Brazil (Oct 2017), https://www.aclweb.org/anthology/W17-6615
Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning. Springer Series in Statistics, Springer New York Inc., New York, NY, USA (2001).
Kim, B., Glassman, E., Johnson, B., Shah, J.: ibcm: Interactive bayesian case model empowering humans via intuitive interaction. Tech. rep., MIT-CSAIL, Cambridge, MA 02142-1209 (2015).
Kranjc, J., Smailovi, J., Podpecan, V., Grcar, M., Znidarsic, M., Lavrac, N.: Active learning for sentiment analysis on data streams: Methodology and workflow implementation in the clowdflows platform. Information Processing & Management 51(2), 187–203 (2015).
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems. pp. 3111–3119 (2013).
Mishra, S., Diesner, J., Byrne, J., Surbeck, E.: Sentiment analysis with incremental human-in-the-loop learning and lexical resource customization. In: Proceedings of the 26th ACM Conference on Hypertext & Social Media. pp. 323–325 (2015).
Pennington, J., Socher, R., Manning, C.: GloVe: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). pp. 1532–1543. Association for Computational Linguistics, Doha, Qatar (Oct 2014). https://doi.org/10.3115/v1/D14-1162, https://www.aclweb.org/anthology/D14-1162
Settles, B.: Active learning literature survey. Tech. rep., University of Wisconsin- Madison Department of Computer Sciences (2009).
Souza, F., Nogueira, R., Lotufo, R.: Bertimbau: Pretrained bert models for brazilian portuguese. In: Brazilian Conference on Intelligent Systems. pp. 403–417. Springer (2020).
Vitório, D., Souza, E., Oliveira, A.L.I.: Evaluating active learning sampling strategies for opinion mining in brazilian politics corpora. In: Moura Oliveira, P., Novais, P., Reis, L.P. (eds.) Progress in Artificial Intelligence. pp. 695–707. Springer International Publishing, Cham (2019).
Yimam, S. M., Biemann, C., Majnaric, L., Sabanovic, S., Holzinger, A.: Interactive and iterative annotation for biomedical entity recognition. In: Guo, Y., Friston, K., Aldo, F., Hill, S., Peng, H. (eds.) Brain Informatics and Health. pp. 347–357. Springer International Publishing, Cham (2015).
Zimmermann, M., Ntoutsi, E., Spiliopoulou, M.: Incremental active opinion learning over a stream of opinionated documents. WISDOM 2015 (KDD’15) (10 2015).