Using a labeling function for automatic classification of agribusiness news: A weak supervisory approach

Rodrigo Neves Trindade; Luiz H. D. Martins; Geraldo Nunes Correa; Ivan José dos Reis Filho

doi:10.5753/eniac.2022.227219

Rodrigo Neves Trindade UEMG
Luiz H. D. Martins UEMG
Geraldo Nunes Correa UEMG
Ivan José dos Reis Filho UEMG

DOI: https://doi.org/10.5753/eniac.2022.227219

Resumo

O grande volume de notícias geradas na internet têm aumentado o uso de aplicações com aprendizado de máquina. Modelos preditivos precisam de amostras rotuladas em grande quantidade e qualidade para garantir boa acurácia em tarefas de classificação. No entanto, a tarefas de rotular notícias é manual e demanda tempo do especialista de domínio. Neste trabalho, uma função é proposta para rotular notícias do agronegócio. Oscilações das séries de preços da soja no mercado nacional, internacional e cotação do dólar são a entrada para a função de rotulagem. Diferentes paradigmas de aprendizado e representações textuais são usadas na etapa de avaliação. Os modelos de linguagem neural demonstraram melhor desempenho e os resultados indicam que a proposta pode ser uma alternativa para aplicações de tempo real.

Referências

Aggarwal, C. C. (2018). Machine learning for text, volume 848. Springer.

Aggarwal, C. C. and Reddy, C. K. (2014). Data clustering. Algorithms and applications. Chapman&Hall/CRC Data mining and Knowledge Discovery series, Londra.

Anklin, V., Pati, P., Jaume, G., Bozorgtabar, B., Foncubierta-Rodriguez, A., Thiran, J.-P., Sibony, M., Gabrani, M., and Goksel, O. (2021). Learning whole-slide segmentation from inexact and incomplete labels using tissue graphs. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 636-646. Springer.

Boecking, B., Neiswanger, W., Xing, E., and Dubrawski, A. (2020). Interactive weak supervision: Learning useful heuristics for data labeling. arXiv preprint arXiv:2012.06046.

Chatfield, C. and Xing, H. (2019). The Analysis of Time Series: an introduction with R. CRC press.

Chen, L.-M., Xiu, B.-X., and Ding, Z.-Y. (2022). Multiple weak supervision for short text classification. Applied Intelligence, 52(8):9101-9116.

Dai, E., Shu, K., Sun, Y., and Wang, S. (2021). Labeled data generation with inexact supervision. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pages 218-226.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.

García, S., Fernández, A., Luengo, J., and Herrera, F. (2010). Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power. Information sciences, 180(10):2044-2064.

Helmstetter, S. and Paulheim, H. (2021). Collecting a large scale dataset for classifying fake news tweets using weak supervision. Future Internet, 13(5):114.

Lasserre, J. A., Bishop, C. M., and Minka, T. P. (2006). Principled hybrids of generative and discriminative models. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06), volume 1, pages 87-94. IEEE.

Lison, P., Hubin, A., Barnes, J., and Touileb, S. (2020). Named entity recognition without labelled data: A weak supervision approach. arXiv preprint arXiv:2004.14723.

Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov, V. (2019). Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.

Munezero, M., Montero, C. S., Sutinen, E., and Pajunen, J. (2014). Are they different? affect, feeling, emotion, sentiment, and opinion detection in text. IEEE transactions on affective computing, 5(2):101-111.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825-2830.

Ratner, A., Bach, S. H., Ehrenberg, H., Fries, J., Wu, S., and Ré, C. (2020). Snorkel: Rapid training data creation with weak supervision. The VLDB Journal, 29(2):709-730.

Sanh, V., Debut, L., Chaumond, J., and Wolf, T. (2019). Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108.

Souza, F., Nogueira, R., and Lotufo, R. (2020). BERTimbau: pretrained BERT models for Brazilian Portuguese. In 9th Brazilian Conference on Intelligent Systems, BRACIS, Rio Grande do Sul, Brazil, October 20-23 (to appear).

Wang, Y., Yang, W., Ma, F., Xu, J., Zhong, B., Deng, Q., and Gao, J. (2020). Weak supervision for fake news detection via reinforcement learning. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 516-523.

Zhou, Z.-H. (2018). A brief introduction to weakly supervised learning. National science review, 5(1):44-53.

Zhu, X. J. (2005). Semi-supervised learning literature survey.