Semantic Textual Similarity: In Defense of Wordnet-Based Methods

Resumo


Wordnets have long been used as a tool for evaluating the semantic similarity between short texts. In addition to being simpler than recent deep learning approaches, methods based on wordnets offer an important advantage: they deliver results that are easy to interpret as their decisions are usually taken by considering the proximity between graph nodes. In this work, we explore a lightweight approach based on a Portuguese wordnet to solve the ASSIN 2 Semantic Textual Similarity (STS) shared task. In this task, each object of a dataset consists of a pair of Portuguese sentences annotated with its semantic score and the goal is to learn an STS model to estimate the similarity value of new, previously unseen, sentence pairs. Experiments show that our results are competitive with state-of-the-art methods in terms of mean squared error.
Palavras-chave: Semantic Textual Similarity, WordNet, Portuguese, Supervised Machine Learning

Referências

Agirre, E., Cer, D., Diab, M., and Gonzalez-Agirre, A. (2012). "SemEval-2012 Task 6: A Pilot on Semantic Textual Similarity". In: Proc. of the 6th Intl’ Wksp on Semantic Evaluation (SemEval-2012), ACL, p. 385–393.

Anthopoulos, T. and Wood, M. (2021) “Automated coding of Standard Industrial and Occupational Classifications (SIC/SOC)”, [link], June. [link].

Bird, S., Loper, E., and Klein, E. (2009). Natural language processing with python, O’Reilly Media Inc.

Cabezudo, M. A. S., Inácio, M., Rodrigues, A. C., Casanova, E., and de Sousa, R. F. (2019). “NILC at ASSIN 2: Exploring Multilingual Approaches”. In: Proc. of the ASSIN2 Shared Task: Evaluating Semantic Textual Similarity and Textual Entailment in Portuguese co-located with XII Symp. in Inf. and Human Language Technology (STIL), CEUR, p. 49–58.

Cer, D., Diab, M., Agirre, E., Lopez-Gazpio, I., and Specia, L. (2017). “SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation”. In: Proc. of the 11th Intl’ Wksp on Semantic Evaluation (SemEval-2017), ACL, p. 1–14. https://www.doi.org/10.18653/v1/S17-2001

Chandrasekaran, D. and Mago, V. (2021). Evolution of semantic similarity: A survey. In ACM Comput. Surv., 54(2), pages 41:1–41:37. ACM. https://doi.org/10.1145/3440755

Croft, D, Coupland, S., Shell, J., Brown, S. (2013) “A Fast and Efficient Semantic Short Text Measure”, In: Proc. of the 13rd UK Workshop on Computational Intelligence (UKCI), IEEE, p. 221–227. https://doi.org/10.1109/UKCI.2013.6651309

Darrazão, E., Amorim, V., Oliveira, K., Gomes-Jr, L. (2023). “Engenharia e Avaliação de Features para Extração de Informação em Notas Fiscais”. In: Anais da XVIII Escola Regional de Banco de Dados (ERBD), SBC, p. 80–89. https://doi.org/10.5753/erbd.2023.229441

Devlin, J., Chang, M. W., Lee, K., Toutanova, K. (2019). “BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding”. In: Proc. of the 2019 Conf. of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol. 1 (Long and Short Papers), ACL, p. 4171–4186. http://dx.doi.org/10.18653/v1/N19-1423

Fellbaum, C. (1998). WordNet: an electronic lexical database, MIT Press, Cambridge. https://doi.org/10.7551/mitpress/7287.001.0001

Ferrero, J., Besacier, L., Schwab, D., and Agnès, F. (2017). “CompiLIG at SemEval-2017 Task 1: Cross-Language Plagiarism Detection Methods for Semantic Textual Similarity”. In: Proc. of the 11th Intl’ Wksp on Semantic Evaluation (SemEval-2017), ACL, p. 109–114. http://dx.doi.org/10.18653/v1/S17-2012

Fonseca, E., and Alvarenga, J. P. R. (2019). “Wide And Deep Transformers Applied to Semantic Relatedness and Textual Entailment”. In: Proc. of the ASSIN2 Shared Task: Evaluating Semantic Textual Similarity and Textual Entailment in Portuguese co-located with XII Symp. in Inf. and Human Language Technology (STIL), CEUR, p. 68–76.

Fonseca, E. R., Borges dos Santos, L., Criscuolo, M., and Aluísio, S. M. (2016). Visão geral da avaliação de similaridade semântica e inferência textual. In Linguamática, 8(2), pages 3–13. UMinho / UVigo.

Freitas, A. A. (2014). Comprehensible classification models – A position paper. In SIGKDD Explorations, 15(1), pages 1–10. ACM. https://doi.org/10.1145/2594473.2594475

Gonçalo Oliveira, H. (2018). Distributional and knowledge-based approaches for computing Portuguese word similarity. In Information, 9(35), pages 1–21. MDPI. https://doi.org/10.3390/info9020035

Gonçalo Oliveira, H. and Gomes, P. (2014). ECO and Onto.PT: A flexible approach for creating a Portuguese wordnet automatically. In Language Resources and Evaluation, 48(2), pages 373–393. Springer. https://doi.org/10.1007/s10579-013-9249-9

Gonçalo Oliveira, H., Aguiar, F. S. S., and Rademaker, A. (2021). “On the Utility of Word Embeddings for Enriching OpenWordNet-PT”, In: Proc. of the 3rd Conf. on Language, Data and Knowledge (LDK 2021), OASIcs, p. 21:1–21:13. https://doi.org/10.4230/OASIcs.LDK.2021.21

Li, Y., McLean, D., Bandar, Z. A., O’Shea, J. D., Crockett, K. (2006). Sentence similarity based on semantic nets and corpus statistics. In IEEE Transactions on Knowledge and Data Engineering, 18(8), pages 1138–1150. IEEE. https://doi.org/10.1109/TKDE.2006.130

de Lima, L. S. G. and Gonçalves, E. C. (2022). “Similaridade Semântica de Nomes de Produtos Alimentícios Utilizando Wordnets do Português”. In: Proc. of the XV Seminar on Ontology Research in Brazil (ONTOBRAS 2022) and VI Doctoral and Masters Consortium on Ontologies (WTDO 2022), CEUR, p. 23–31.

Orengo, V. M. and Huyck, C. (2001). “A Stemming Algorithm for the Portuguese Language”. In: Proc. of the 8th Symposium on String Processing and Information Retrieval, IEEE, p. 186–193. https://doi.org/10.1109/SPIRE.2001.989755

de Paiva, V., Real, L., Gonçalo Oliveira, H., Rademaker, A., Freitas, C., Simões, A. (2016) “An overview of Portuguese WordNets”, In: Proc. of the 8th Global WordNet Conference (GWC 2016), ACL, p. 74–81.

Pedregosa et al. (2011). Scikit-learn: Machine learning in python. In JMLR 12, pages 2825–2830.

Pilehvar, M. T. and Navigli, R. (2015). From senses to texts: An all-in-one graph-based approach for measuring semantic similarity. In Artificial Intelligence, 228, pages 95–128. Elsevier. https://doi.org/10.1016/j.artint.2015.07.005

Real, L., Fonseca, E., and Gonçalo Oliveira, H. (2019). “Organizing the ASSIN 2 Shared Task”. In: Proc. of the ASSIN2 Shared Task: Evaluating Semantic Textual Similarity and Textual Entailment in Portuguese co-located with XII Symp. in Inf. and Human Language Technology (STIL), CEUR, p. 1–13.

Rodrigues, R., Couto, P., and Rodrigues, I. (2019a). “IPR: The Semantic Textual Similarity and Recognizing Textual Entailment Systems”. In: Proc. of the ASSIN2 Shared Task: Evaluating Semantic Textual Similarity and Textual Entailment in Portuguese co-located with XII Symp. in Inf. and Human Language Technology (STIL), CEUR, p. 39–47.

Rodrigues, R. C., da Silva, J. R., de Castro, P. V. Q., da Silva, N. F. F., Soares, A. S. (2019b). “Multilingual Transformer Ensembles for Portuguese Natural Language Tasks”. In: Proc. of the ASSIN2 Shared Task: Evaluating Semantic Textual Similarity and Textual Entailment in Portuguese co-located with XII Symp. in Inf. and Human Language Technology (STIL), CEUR, p. 27–38.

Santos, J., Alves, A. and Gonçalo Oliveira, H. (2019). “ASAPPpy: a Python Framework for Portuguese STS”. In: Proc. of the ASSIN2 Shared Task: Evaluating Semantic Textual Similarity and Textual Entailment in Portuguese co-located with XII Symp. in Inf. and Human Language Technology (STIL), CEUR, p. 14–26.

Soares, M. A. C. and Parreiras, F. S. (2020). A literature review on question answering techniques, paradigms and systems. In Journal of King Saud University - Computer and Information Sciences, 32(6), pages 635–646. Elsevier. https://doi.org/10.1016/j.jksuci.2018.08.005

de Souza, J. V. A., Oliveira, L. E. S., Gumiel, Y. B., Carvalho, D. R., Moro, C. M. C. (2019). “Incorporating Multiple Feature Groups to a Siamese Neural Network for Semantic Textual Similarity Task in Portuguese Texts”. In: Proc. of the ASSIN2 Shared Task: Evaluating Semantic Textual Similarity and Textual Entailment in Portuguese co-located with XII Symp. in Inf. and Human Language Technology (STIL), CEUR, p. 59–68.

Wang, Y., Fu, S., Shen, F., Henry, S., Uzuner, O., and Liu, H. (2020). Overview of the 2019 n2c2/OHNLP Track on Clinical Semantic Textual Similarity. In JMIR Med Inform., 8(11):e23375. JMIR. https://doi.org/10.2196/23375
Publicado
25/09/2023
Como Citar

Selecione um Formato
GONÇALVES, Eduardo Corrêa. Semantic Textual Similarity: In Defense of Wordnet-Based Methods. In: SIMPÓSIO BRASILEIRO DE TECNOLOGIA DA INFORMAÇÃO E DA LINGUAGEM HUMANA (STIL), 14. , 2023, Belo Horizonte/MG. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2023 . p. 22-32. DOI: https://doi.org/10.5753/stil.2023.233464.