Comparing Contextual Embeddings for Semantic Textual Similarity in Portuguese

Andrade Junior, José E.; Cardoso-Silva, Jonathan; Bezerra, Leonardo C. T.

doi:10.1007/978-3-030-91699-2_27

Comparing Contextual Embeddings for Semantic Textual Similarity in Portuguese

José E. Andrade Junior^10,11,
Jonathan Cardoso-Silva^12,13 &
Leonardo C. T. Bezerra¹⁰

Conference paper
First Online: 28 November 2021

1004 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13074))

Abstract

Semantic textual similarity (STS) measures how semantically similar two sentences are. In the context of the Portuguese language, STS literature is still incipient but includes important initiatives like the ASSIN and ASSIN 2 shared tasks. The state-of-the-art for those datasets is a contextual embedding produced by a Portuguese pre-trained and fine-tuned BERT model. In this work, we investigate the application of Sentence-BERT (SBERT) contextual embeddings to these datasets. Compared to BERT, SBERT is a more computationally efficient approach, enabling its application to scalable unsupervised learning problems. Given the absence of SBERT models pre-trained in Portuguese and the computational cost for such training, we adopt multilingual models and also fine-tune them for Portuguese. Results showed that SBERT embeddings were competitive especially after fine-tuning, numerically surpassing the results of BERT on ASSIN 2 and the results observed during the shared tasks for all datasets considered.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
For brevity, other relevant works such as [19] comparing Word2Vec, FastText, ELMO, and BERT on ASSIN are not included as their results are surpassed by [8].
2.
Though models based on other relevant architectures such as the multilingual universal sentence encoder [23] were available at the SBERT repository, we did not include them in our work due to the lack of training setup details.

References

Agirre, E., Diab, M., Cer, D., Gonzalez-Agirre, A.: SemEval-2012 task 6: a pilot on semantic textual similarity. In: SemEval, pp. 385–393. ACL, USA (2012)
Google Scholar
Andrade, J., Bezerra, L.C.T., Cardoso-Silva, J.: Comparing contextual embeddings for semantic textual similarity in portuguese (supplementary material) (2021). https://github.com/andradejunior/bracis-2021-supp-material
Bengio, Y., Ducharme, R., Vincent, P., Janvin, C.: A neural probabilistic language model. J. Mach. Learn. Res. 3, 1137–1155 (2003)
MATH Google Scholar
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. ACL 5, 135–146 (2017)
Google Scholar
Cer, D., Diab, M., Agirre, E., Lopez-Gazpio, I., Specia, L.: SemEval-2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation. In: SemEval, pp. 1–14. ACL, Vancouver (2017)
Google Scholar
Conneau, A., et al.: Unsupervised cross-lingual representation learning at scale. In: ACL, pp. 8440–8451. ACL (2020)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL, pp. 4171–4186. ACL, Minneapolis (2019)
Google Scholar
Fialho, P., Coheur, L., Quaresma, P.: Benchmarking natural language inference and semantic textual similarity for Portuguese. Information 11, 484 (2020)
Article Google Scholar
Fonseca, E.R., Borges dos Santos, L., Criscuolo, M., Aluísio, S.M.: Visão geral da avaliação de similaridade semântica e inferência textual. Linguamática 8(2), 3–13 (2016)
Google Scholar
Gou, J., Yu, B., Maybank, S.J., Tao, D.: Knowledge distillation: a survey. Int. J. Comput. Vis. 129(6), 1789–1819 (2021)
Article Google Scholar
Jurafsky, D., Martin, J.H.: Speech and Language Processing, 2nd edn. Prentice-Hall Inc., Upper Saddle River (2009)
Google Scholar
Marelli, M., Bentivogli, L., Baroni, M., Bernardi, R., Menini, S., Zamparelli, R.: SemEval-2014 task 1: evaluation of compositional distributional semantic models on full sentences through semantic relatedness and textual entailment. In: SemEval, pp. 1–8. ACL, Dublin (2014)
Google Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. In: NeurIPS, pp. 3111–3119. Curran Associates Inc., Red Hook (2013)
Google Scholar
Pennington, J., Socher, R., Manning, C.: GloVe: global vectors for word representation. In: EMNLP, pp. 1532–1543. ACL, Doha, Qatar (2014)
Google Scholar
Real, L., Fonseca, E., Gonçalo Oliveira, H.: The ASSIN 2 shared task: a quick overview. In: Quaresma, P., Vieira, R., Aluísio, S., Moniz, H., Batista, F., Gonçalves, T. (eds.) PROPOR 2020. LNCS (LNAI), vol. 12037, pp. 406–412. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-41505-1_39
Chapter Google Scholar
Real, L., et al.: SICK-BR: a Portuguese corpus for inference. In: Villavicencio, A. (ed.) PROPOR 2018. LNCS (LNAI), vol. 11122, pp. 303–312. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-99722-3_31
Chapter Google Scholar
Reimers, N., Gurevych, I.: Sentence-BERT: sentence embeddings using siamese BERT-networks. In: EMNLP, pp. 3973–3983. ACL (2019)
Google Scholar
Reimers, N., Gurevych, I.: Making monolingual sentence embeddings multilingual using knowledge distillation. In: EMNLP, pp. 4512–4525. ACL (2020)
Google Scholar
Rodrigues, R.C., Rodrigues, J., de Castro, P.V.Q., da Silva, N.F.F., Soares, A.: Portuguese language models and word embeddings: evaluating on semantic similarity tasks. In: Quaresma, P., Vieira, R., Aluísio, S., Moniz, H., Batista, F., Gonçalves, T. (eds.) PROPOR 2020. LNCS (LNAI), vol. 12037, pp. 239–248. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-41505-1_23
Chapter Google Scholar
Souza, F., Nogueira, R., Lotufo, R.: BERTimbau: pretrained BERT models for Brazilian Portuguese. In: Cerri, R., Prati, R.C. (eds.) BRACIS 2020. LNCS (LNAI), vol. 12319, pp. 403–417. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-61377-8_28
Chapter Google Scholar
Turney, P.D., Pantel, P.: From frequency to meaning: vector space models of semantics. J. Artif. Int. Res. 37(1), 141–188 (2010)
MathSciNet MATH Google Scholar
Wagner Filho, J.A., Wilkens, R., Idiart, M., Villavicencio, A.: The brWaC corpus: a new open resource for Brazilian Portuguese. In: LREC. ELRA, Miyazaki, Japan (2018)
Google Scholar
Yang, Y., et al.: Multilingual universal sentence encoder for semantic retrieval. In: ACL: System Demonstrations, pp. 87–94. ACL, Online, July 2020
Google Scholar
Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R.R., Le, Q.V.: XLNet: generalized autoregressive pretraining for language understanding. In: Wallach, H., Larochelle, H., Beygelzimer, A., d’ Alché-Buc, F., Fox, E., Garnett, R. (eds.) NeurIPS, vol. 32. Curran Associates, Inc. (2019)
Google Scholar
Yosinski, J., Clune, J., Bengio, Y., Lipson, H.: How transferable are features in deep neural networks? In: Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N., Weinberger, K.Q. (eds.) NIPS, vol. 27. Curran Associates, Inc. (2014)
Google Scholar

Download references

Author information

Authors and Affiliations

IMD, Universidade Federal do Rio Grande do Norte, Natal, RN, Brazil
José E. Andrade Junior & Leonardo C. T. Bezerra
iFood, Osasco, SP, Brazil
José E. Andrade Junior
Data Science Brigade, Porto Alegre, RS, Brazil
Jonathan Cardoso-Silva
London School of Economics and Political Science, London, UK
Jonathan Cardoso-Silva

Authors

José E. Andrade Junior
View author publications
You can also search for this author in PubMed Google Scholar
Jonathan Cardoso-Silva
View author publications
You can also search for this author in PubMed Google Scholar
Leonardo C. T. Bezerra
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to José E. Andrade Junior .

Editor information

Editors and Affiliations

Universidade Federal de Sergipe, São Cristóvão, Brazil
André Britto
Universidade de São Paulo, São Paulo, Brazil
Karina Valdivia Delgado

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 22 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Andrade Junior, J.E., Cardoso-Silva, J., Bezerra, L.C.T. (2021). Comparing Contextual Embeddings for Semantic Textual Similarity in Portuguese. In: Britto, A., Valdivia Delgado, K. (eds) Intelligent Systems. BRACIS 2021. Lecture Notes in Computer Science(), vol 13074. Springer, Cham. https://doi.org/10.1007/978-3-030-91699-2_27

Download citation

DOI: https://doi.org/10.1007/978-3-030-91699-2_27
Published: 28 November 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-91698-5
Online ISBN: 978-3-030-91699-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics