Skip to main content

Comparing Contextual Embeddings for Semantic Textual Similarity in Portuguese

  • Conference paper
  • First Online:
  • 1004 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13074))

Abstract

Semantic textual similarity (STS) measures how semantically similar two sentences are. In the context of the Portuguese language, STS literature is still incipient but includes important initiatives like the ASSIN and ASSIN 2 shared tasks. The state-of-the-art for those datasets is a contextual embedding produced by a Portuguese pre-trained and fine-tuned BERT model. In this work, we investigate the application of Sentence-BERT (SBERT) contextual embeddings to these datasets. Compared to BERT, SBERT is a more computationally efficient approach, enabling its application to scalable unsupervised learning problems. Given the absence of SBERT models pre-trained in Portuguese and the computational cost for such training, we adopt multilingual models and also fine-tune them for Portuguese. Results showed that SBERT embeddings were competitive especially after fine-tuning, numerically surpassing the results of BERT on ASSIN 2 and the results observed during the shared tasks for all datasets considered.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    For brevity, other relevant works such as [19] comparing Word2Vec, FastText, ELMO, and BERT on ASSIN are not included as their results are surpassed by [8].

  2. 2.

    Though models based on other relevant architectures such as the multilingual universal sentence encoder [23] were available at the SBERT repository, we did not include them in our work due to the lack of training setup details.

References

  1. Agirre, E., Diab, M., Cer, D., Gonzalez-Agirre, A.: SemEval-2012 task 6: a pilot on semantic textual similarity. In: SemEval, pp. 385–393. ACL, USA (2012)

    Google Scholar 

  2. Andrade, J., Bezerra, L.C.T., Cardoso-Silva, J.: Comparing contextual embeddings for semantic textual similarity in portuguese (supplementary material) (2021). https://github.com/andradejunior/bracis-2021-supp-material

  3. Bengio, Y., Ducharme, R., Vincent, P., Janvin, C.: A neural probabilistic language model. J. Mach. Learn. Res. 3, 1137–1155 (2003)

    MATH  Google Scholar 

  4. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. ACL 5, 135–146 (2017)

    Google Scholar 

  5. Cer, D., Diab, M., Agirre, E., Lopez-Gazpio, I., Specia, L.: SemEval-2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation. In: SemEval, pp. 1–14. ACL, Vancouver (2017)

    Google Scholar 

  6. Conneau, A., et al.: Unsupervised cross-lingual representation learning at scale. In: ACL, pp. 8440–8451. ACL (2020)

    Google Scholar 

  7. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL, pp. 4171–4186. ACL, Minneapolis (2019)

    Google Scholar 

  8. Fialho, P., Coheur, L., Quaresma, P.: Benchmarking natural language inference and semantic textual similarity for Portuguese. Information 11, 484 (2020)

    Article  Google Scholar 

  9. Fonseca, E.R., Borges dos Santos, L., Criscuolo, M., Aluísio, S.M.: Visão geral da avaliação de similaridade semântica e inferência textual. Linguamática 8(2), 3–13 (2016)

    Google Scholar 

  10. Gou, J., Yu, B., Maybank, S.J., Tao, D.: Knowledge distillation: a survey. Int. J. Comput. Vis. 129(6), 1789–1819 (2021)

    Article  Google Scholar 

  11. Jurafsky, D., Martin, J.H.: Speech and Language Processing, 2nd edn. Prentice-Hall Inc., Upper Saddle River (2009)

    Google Scholar 

  12. Marelli, M., Bentivogli, L., Baroni, M., Bernardi, R., Menini, S., Zamparelli, R.: SemEval-2014 task 1: evaluation of compositional distributional semantic models on full sentences through semantic relatedness and textual entailment. In: SemEval, pp. 1–8. ACL, Dublin (2014)

    Google Scholar 

  13. Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. In: NeurIPS, pp. 3111–3119. Curran Associates Inc., Red Hook (2013)

    Google Scholar 

  14. Pennington, J., Socher, R., Manning, C.: GloVe: global vectors for word representation. In: EMNLP, pp. 1532–1543. ACL, Doha, Qatar (2014)

    Google Scholar 

  15. Real, L., Fonseca, E., Gonçalo Oliveira, H.: The ASSIN 2 shared task: a quick overview. In: Quaresma, P., Vieira, R., Aluísio, S., Moniz, H., Batista, F., Gonçalves, T. (eds.) PROPOR 2020. LNCS (LNAI), vol. 12037, pp. 406–412. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-41505-1_39

    Chapter  Google Scholar 

  16. Real, L., et al.: SICK-BR: a Portuguese corpus for inference. In: Villavicencio, A. (ed.) PROPOR 2018. LNCS (LNAI), vol. 11122, pp. 303–312. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-99722-3_31

    Chapter  Google Scholar 

  17. Reimers, N., Gurevych, I.: Sentence-BERT: sentence embeddings using siamese BERT-networks. In: EMNLP, pp. 3973–3983. ACL (2019)

    Google Scholar 

  18. Reimers, N., Gurevych, I.: Making monolingual sentence embeddings multilingual using knowledge distillation. In: EMNLP, pp. 4512–4525. ACL (2020)

    Google Scholar 

  19. Rodrigues, R.C., Rodrigues, J., de Castro, P.V.Q., da Silva, N.F.F., Soares, A.: Portuguese language models and word embeddings: evaluating on semantic similarity tasks. In: Quaresma, P., Vieira, R., Aluísio, S., Moniz, H., Batista, F., Gonçalves, T. (eds.) PROPOR 2020. LNCS (LNAI), vol. 12037, pp. 239–248. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-41505-1_23

    Chapter  Google Scholar 

  20. Souza, F., Nogueira, R., Lotufo, R.: BERTimbau: pretrained BERT models for Brazilian Portuguese. In: Cerri, R., Prati, R.C. (eds.) BRACIS 2020. LNCS (LNAI), vol. 12319, pp. 403–417. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-61377-8_28

    Chapter  Google Scholar 

  21. Turney, P.D., Pantel, P.: From frequency to meaning: vector space models of semantics. J. Artif. Int. Res. 37(1), 141–188 (2010)

    MathSciNet  MATH  Google Scholar 

  22. Wagner Filho, J.A., Wilkens, R., Idiart, M., Villavicencio, A.: The brWaC corpus: a new open resource for Brazilian Portuguese. In: LREC. ELRA, Miyazaki, Japan (2018)

    Google Scholar 

  23. Yang, Y., et al.: Multilingual universal sentence encoder for semantic retrieval. In: ACL: System Demonstrations, pp. 87–94. ACL, Online, July 2020

    Google Scholar 

  24. Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R.R., Le, Q.V.: XLNet: generalized autoregressive pretraining for language understanding. In: Wallach, H., Larochelle, H., Beygelzimer, A., d’ Alché-Buc, F., Fox, E., Garnett, R. (eds.) NeurIPS, vol. 32. Curran Associates, Inc. (2019)

    Google Scholar 

  25. Yosinski, J., Clune, J., Bengio, Y., Lipson, H.: How transferable are features in deep neural networks? In: Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N., Weinberger, K.Q. (eds.) NIPS, vol. 27. Curran Associates, Inc. (2014)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to José E. Andrade Junior .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 22 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Andrade Junior, J.E., Cardoso-Silva, J., Bezerra, L.C.T. (2021). Comparing Contextual Embeddings for Semantic Textual Similarity in Portuguese. In: Britto, A., Valdivia Delgado, K. (eds) Intelligent Systems. BRACIS 2021. Lecture Notes in Computer Science(), vol 13074. Springer, Cham. https://doi.org/10.1007/978-3-030-91699-2_27

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-91699-2_27

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-91698-5

  • Online ISBN: 978-3-030-91699-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics