How aspects of similar datasets can impact distributional models

  • Isabella Maria Alonso Gomes USP
  • Norton Trevisan Roman USP


Distributional models have become popular due to the abstractions that allowed their immediate use, with good results and little implementation effort when compared to precursor models. Given their presumed high level of generalization it would be expected that good and similar results would be found in data sets sharing the same nature and purpose. However, this is not always the case. In this work, we present the results of the application of BERTimbau in two related data sets, built for the task of Semantic Similarity identification, with the goal of detecting redundancy in text. Results showed that there are considerable differences in accuracy between the data sets. We explore aspects of the data sets that could explain why accuracy results are different across them.


Canete, J., Chaperon, G., Fuentes, R., Ho, J.-H., Kang, H., and Pérez, J. (2020). Spanish pre-trained bert model and evaluation data. Pml4dc at iclr, 2020:1-10.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.

Ede, S., Baghdadlian, S., Weber, L., Samek, W., and Lapuschkin, S. (2022). Explain to not forget: Defending against catastrophic forgetting with xai. arXiv preprint arXiv:2205.01929.

Fonseca, E., Santos, L., Criscuolo, M., and Aluisio, S. (2016). Assin: Avaliacao de similaridade semantica e inferencia textual. In Computational Processing of the Portuguese Language-12th International Conference, Tomar, Portugal, pages 13-15.

Hendrycks, D., Liu, X., Wallace, E., Dziedzic, A., Krishnan, R., and Song, D. (2020). Pretrained transformers improve out-of-distribution robustness. arXiv preprint arXiv:2004.06100.

Liu, Y. (2019). Fine-tune bert for extractive summarization. arXiv preprint arXiv:1903.10318.

Marco, M., Luisa, B., Raffaella, B., Stefano, M., Roberto, Z., et al. (2014). Semeval-2014 task 1: Evaluation of compositional distributional semantic models on full sentences through semantic relatedness and textual entailment. In Proc. SemEval, pages 1-8.

Mosbach, M., Andriushchenko, M., and Klakow, D. (2020). On the stability of fine-tuning bert: Misconceptions, explanations, and strong baselines. arXiv preprint arXiv:2006.04884.

Real, L., Fonseca, E., and Oliveira, H. G. (2020). The assin 2 shared task: a quick overview. In International Conference on Computational Processing of the Portuguese Language, pages 406-412. Springer.

Real, L., Rodrigues, A., e Silva, A. V., Albiero, B., Thalenberg, B., Guide, B., Silva, C., de Oliveira Lima, G., Câmara, I. C., Stanojević, M., et al. (2018). Sick-br: a portuguese corpus for inference. In International Conference on Computational Processing of the Portuguese Language, pages 303-312. Springer.

Rosu, R., Stoica, A. S., Popescu, P. S., and Mihǎescu, M. C. (2021). Nlp based deep learning approach for plagiarism detection. In RoCHI-International Conference on Human-Computer Interaction, Romania.

Souza, F., Nogueira, R., and Lotufo, R. (2020). Bertimbau: pretrained bert models for brazilian portuguese. In Brazilian Conference on Intelligent Systems, pages 403-417. Springer.

Swayamdipta, S., Schwartz, R., Lourie, N., Wang, Y., Hajishirzi, H., Smith, N. A., and Choi, Y. (2020). Dataset cartography: Mapping and diagnosing datasets with training dynamics. arXiv preprint arXiv:2009.10795.

Tu, L., Lalwani, G., Gella, S., and He, H. (2020). An empirical study on robustness to spurious correlations using pre-trained language models. Transactions of the Association for Computational Linguistics, 8:621-633.

Wagner Filho, J. A., Wilkens, R., Idiart, M., and Villavicencio, A. (2018). The brWaC corpus: A new open resource for Brazilian Portuguese. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. European Language Resources Association (ELRA).

Wang, Z., Ng, P., Ma, X., Nallapati, R., and Xiang, B. (2019). Multi-passage bert: A globally normalized bert model for open-domain question answering. arXiv preprint arXiv:1908.08167.

Yang, J., Zhou, K., Li, Y., and Liu, Z. (2021). Generalized out-of-distribution detection: A survey. arXiv preprint arXiv:2110.11334.

Yang, W., Xie, Y., Lin, A., Li, X., Tan, L., Xiong, K., Li, M., and Lin, J. (2019). End-to-end open-domain question answering with bertserini. arXiv preprint arXiv:1902.01718.
Como Citar

Selecione um Formato
GOMES, Isabella Maria Alonso; ROMAN, Norton Trevisan. How aspects of similar datasets can impact distributional models. In: ENCONTRO NACIONAL DE INTELIGÊNCIA ARTIFICIAL E COMPUTACIONAL (ENIAC), 19. , 2022, Campinas/SP. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2022 . p. 579-590. ISSN 2763-9061. DOI:

Artigos mais lidos do(s) mesmo(s) autor(es)