Investigating the effects of synthetic text generation for question answering: Empirical studies on E-commerce context

  • Víctor Jesús Sotelo Chico Unicamp
  • Victor Hochgreb De Freitas GoBots
  • Julio Cesar Dos Reis Unicamp


Identifying semantic relatedness among sentences helps improve semantic retrieval tasks such as frequently asked questions (FAQ) recovery. In this context, historical questions can aid in answering new ones. E-commerce platforms manage several clients’ questions, making them suitable for exploring FAQ retrieval to answer them. However, ranking semantic similarities in e-commerce must deal with specific details such as products, branches, voltage, etc. Small changes in these characteristics can drastically change the question objectives. Although there exist datasets that help us to train models to learn semantic similarity, they are composed of general questions which might derive from poor ranking for specific contexts. Most of these datasets are only available for English, limiting studies to other languages. In this research, we apply a Portuguese sentences’ generation model to train a similarity model for learning further aspects of the e-commerce context. Our approach helps to respond more accurately and improve the ranking qualities. We conduct several experimental evaluations to understand the effects of synthetic text generation techniques in this domain.
Palavras-chave: data augmentation, NLP, semantic textual similarity, e-commerce


Abien Fred Agarap. 2018. Deep Learning using Rectified Linear Units (ReLU). ArXiv abs/1803.08375(2018).

Ateret Anaby-Tavor, Boaz Carmeli, Esther Goldbraich, Amir Kantor, George Kour, Segev Shlomov, Naama Tepper, and Naama Zwerdling. 2020. Do Not Have Enough Data? Deep Learning to the Rescue!Proceedings of the AAAI Conference on Artificial Intelligence 34, 05 (Apr. 2020), 7383–7390.

Daniel Matthew Cer, Yinfei Yang, Sheng yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St. John, Noah Constant, Mario Guajardo-Cespedes, Steve Yuan, C. Tar, Yun-Hsuan Sung, B. Strope, and R. Kurzweil. 2018. Universal Sentence Encoder. ArXiv abs/1803.11175(2018).

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 4171–4186.

Alena Fenogenova. 2021. Russian Paraphrasers: Paraphrase with Transformers. In Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing. Association for Computational Linguistics, Kiyv, Ukraine, 11–19.

Siddhant Garg and Goutham Ramakrishnan. 2020. BAE: BERT-based Adversarial Examples for Text Classification. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Online, 6174–6181.

R. Hadsell, S. Chopra, and Y. LeCun. 2006. Dimensionality Reduction by Learning an Invariant Mapping. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), Vol. 2. 1735–1742.

Elad Hoffer and Nir Ailon. 2015. Deep Metric Learning Using Triplet Network. In Similarity-Based Pattern Recognition, Aasa Feragen, Marcello Pelillo, and Marco Loog (Eds.). Springer International Publishing, Cham, 84–92.

Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. CoRR abs/1412.6980(2015).

Varun Kumar, Ashutosh Choudhary, and Eunah Cho. 2020. Data Augmentation using Pre-trained Transformer Models. In Proceedings of the 2nd Workshop on Life-long Learning for Spoken Language Systems. Association for Computational Linguistics, Suzhou, China, 18–26.

Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 7871–7880.

Lu Liu, Qifei Wu, and Guang Chen. 2021. Improving Dense FAQ Retrieval with Synthetic Training. In 2021 7th IEEE International Conference on Network Intelligence and Digital Content (IC-NIDC). 304–308.

Yosi Mass, Boaz Carmeli, Haggai Roitman, and David Konopnicki. 2020. Unsupervised FAQ Retrieval with Question Generation and BERT. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 807–812.

Sascha Rothe, Shashi Narayan, and Aliaksei Severyn. 2020. Leveraging Pre-trained Checkpoints for Sequence Generation Tasks. Transactions of the Association for Computational Linguistics 8 (2020), 264–280.

Wataru Sakata, Tomohide Shibata, Ribeka Tanaka, and Sadao Kurohashi. 2019. FAQ Retrieval Using Query-Question Similarity and BERT-Based Query-Answer Relevance. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (Paris, France) (SIGIR’19). Association for Computing Machinery, New York, NY, USA, 1113–1116.

Fábio Souza, Rodrigo Nogueira, and Roberto Lotufo. 2020. BERTimbau: Pretrained BERT Models for Brazilian Portuguese. In Intelligent Systems, Ricardo Cerri and Ronaldo C. Prati (Eds.). Springer International Publishing, Cham, 403–417.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems, I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.). Vol. 30. Curran Associates, Inc. [link].

Vinícius Veríssimo and Rostand Costa. 2020. Using Data Augmentation and Neural Networks to Improve the Emotion Analysis of Brazilian Portuguese Texts. In Proceedings of the Brazilian Symposium on Multimedia and the Web (São Luís, Brazil) (WebMedia ’20). Association for Computing Machinery, New York, NY, USA, 13–20.

Sam Witteveen and Martin Andrews. 2019. Paraphrasing with Large Language Models. In Proceedings of the 3rd Workshop on Neural Generation and Translation. Association for Computational Linguistics, Hong Kong, 215–220.

Qizhe Xie, Zihang Dai, Eduard Hovy, Thang Luong, and Quoc Le. 2020. Unsupervised Data Augmentation for Consistency Training. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (Eds.). Vol. 33. Curran Associates, Inc., 6256–6268. [link].

Adams Wei Yu, David Dohan, Minh-Thang Luong, Rui Zhao, Kai Chen, Mohammad Norouzi, and Quoc V. Le. 2018. QANet: Combining Local Convolution with Global Self-Attention for Reading Comprehension. ICLR abs/1804.09541(2018).

Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-level Convolutional Networks for Text Classification. In Advances in Neural Information Processing Systems, C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett (Eds.). Vol. 28. Curran Associates, Inc. [link].
Como Citar

Selecione um Formato
CHICO, Víctor Jesús Sotelo; FREITAS, Victor Hochgreb De; REIS, Julio Cesar Dos. Investigating the effects of synthetic text generation for question answering: Empirical studies on E-commerce context. In: SIMPÓSIO BRASILEIRO DE SISTEMAS MULTIMÍDIA E WEB (WEBMEDIA), 28. , 2022, Curitiba. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2022 . p. 131-140.