Measuring Brazilian Portuguese Product Titles Similarity using Embeddings

Alan da Silva Romualdo; Livy Real; Helena de Medeiros Caseli

doi:10.5753/stil.2021.17791

Alan da Silva Romualdo UFSCar
Livy Real Americanas S. A.
Helena de Medeiros Caseli UFSCar

DOI: https://doi.org/10.5753/stil.2021.17791

Resumo

Textual similarity deals with determining how similar two pieces of texts are, considering the lexical (surface forms) or semantic (meaning) closeness. In this paper we applied word embeddings for measuring e-commerce product title similarity in Brazilian Portuguese. We generated some domainspecific word embeddings (using Word2Vec, FastText and GloVe) and compared them with general-domain models (word embeddings and BERT models). We concluded that the cosine similarity calculated using the domain-specific word embeddings was a good approach to distinguish between similar and nonsimilar products, but the multilingual BERT pre-trained model proved to be the best one.

Referências

Alam, F., Afzal, M., and Malik, K. M. (2020). Comparative analysis of semantic similarity techniques for medical text. In 2020 International Conference on Information Networking (ICOIN), pages 106–109.

Arts, S., Cassiman, B., and Gomez, J. C. (2017). Text matching to measure patent similarity. Strategic Management Journal, 39.

Aryal, S., Ting, K. M., Washio, T., and Haffari, G. (2019). A new simple and effective measure for bag-of-word inter-document similarity measurement. arXiv preprint arXiv:1902.03402.

Bojanowski, P., Grave, E., Joulin, A., and Mikolov, T. (2017). Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5:135–146.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Conference of the North American Chapter of the Association for Computational Linguistics, pages 4171–4186.

Hartmann, N. S., Fonseca, E. R., Shulby, C. D., Treviso, M. V., Rodrigues, J. S., and Aluísio, S. M. (2017). Portuguese word embeddings: Evaluating on word analogies and natural language tasks. In Proceedings of Symposium in Information and Human Language Technology, pages 122–131, Uberlândia, MG, Brasil. SBC.

Lo, C.-k. (2017). MEANT 2.0: Accurate semantic MT evaluation for any output language. In Proceedings of the Second Conference on Machine Translation, pages 589– 597, Copenhagen, Denmark. Association for Computational Linguistics.

Mikolov, T., Sutskever, I., Chen, K., Corrado, G., and Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Proceedings of the 26th International Conference on Neural Information Processing Systems Volume 2, NIPS’13, page 3111–3119, Red Hook, NY, USA. Curran Associates Inc.

Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. (2002). Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL ’02, page 311–318, USA. Association for Computational Linguistics.

Pennington, J., Socher, R., and Manning, C. D. (2014). Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543.

Rodrigues, E. L., Fernandes, L. A., Rodrigues, E. F., de Arruda, I. P., and Moia, R. P. (2014). A importância da distribuição no comércio eletrônico. INOVAE-Journal of Engineering, Architecture and Technology Innovation (ISSN 2357-7797), 1(1):24–38.

Rosa da Silva, R., Fernandes, E., Motta, E., Akira, E., Guarino, R., and Alvim, L. (2017). Offer categorization for price comparison websites: Word embedding approaches. In Martí, L. and Sánchez Pi, N., editors, Anais do 13 Congresso Brasileiro de Inteligência Computacional, pages 1–12, Curitiba, PR. ABRICOM.

Souza, F., Nogueira, R., and Lotufo, R. (2020). BERTimbau: Pretrained BERT Models for Brazilian Portuguese. In Cerri, R. and Prati, R. C., editors, Lecture Notes in Computer Science, volume 12319, pages 403–417, Cham. Springer International Publishing.

Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q., and Artzi, Y. (2020). BERTScore: Evaluating Text Generation with BERT. In Proceedings of the International Conference on Learning Representations.