Generation of test datasets using LLM - Quality Assurance Perspective

  • Jose Leandro Sousa INDT
  • Cristian Souza INDT
  • Raiza Hanada INDT
  • Diogo Nascimento INDT
  • Eliane Collins INDT

Resumo


Domain relevant data and an adequate number of samples are necessary to properly evaluate the robustness of the Machine Learning (ML) models. This is the case for ML models used in the software localization translation task. In general, Neural Machine Translation (NMT) models are used in software localization by automating the translation process of textual content to consider specific linguistic aspects and culture. However, unlike general machine translation which can easily utilize translation corpus for model training and testing, domain-specific machine translation faces a major obstacle due to the scarcity of domain-specific translation data. In the absence of adequate data, this paper first presents a method to generate test samples based on a text generation Large Language Model (LLM) approach. Based on the generated samples, we run tests to assess the robustness of an NMT translation model. The evaluation indicates that human judgment is important to check if the generated text is robust and coherent under different conditions. The evaluation also demonstrates that the generated samples were crucial to show some limitations related to the model’s effectiveness in software localization translation. Basically we discuss issues in specific situations such as date, time formats, numeric representations and measurement units.
Palavras-chave: dataset creation, LLMs evaluation, text generation, neural machine translation, software translation

Referências

Arkadeep Acharya, Brijraj Singh, and Naoyuki Onoe. 2023. LLM Based Generation of Item-Description for Recommendation System. In Proceedings of the 17th ACM Conference on Recommender Systems (Singapore, Singapore) (RecSys ’23). Association for Computing Machinery, New York, NY, USA, 1204–1207. DOI: 10.1145/3604915.3610647

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023).

Karan Aggarwal, Maad M Mijwil, Abdel-Hameed Al-Mistarehi, Safwan Alomari, Murat Gök, Anas M Zein Alaabdin, Safaa H Abdulrhman, et al. 2022. Has the future started? The current growth of artificial intelligence, machine learning, and deep learning. Iraqi Journal for Computer Science and Mathematics 3, 1 (2022), 115–123.

Marwah Alaofi, Luke Gallagher, Mark Sanderson, Falk Scholer, and Paul Thomas. 2023. Can Generative LLMs Create Query Variants for Test Collections? An Exploratory Study. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’23). Association for Computing Machinery, New York, NY, USA, 1869–1873. DOI: 10.1145/3539618.3591960

Zeyad Alshaikh, Shaikh Mostafa, Xiaoyin Wang, and Sen He. 2015. A Empirical Study on the Status of Software Localization in Open Source Projects.. In SEKE. 692–695.

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. arXiv:2005.14165 [cs.CL]

Lukas Budach, Moritz Feuerpfeil, Nina Ihde, Andrea Nathansen, Nele Noack, Hendrik Patzlaff, Felix Naumann, and Hazar Harmouch. 2022. The effects of data quality on machine learning performance. arXiv preprint arXiv:2207.14529 (2022).

John Chung, Ece Kamar, and Saleema Amershi. 2023. Increasing Diversity While Maintaining Accuracy: Text Data Generation with Large Language Models and Human Interventions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (Eds.). Association for Computational Linguistics, Toronto, Canada, 575–593. DOI: 10.18653/v1/2023.acl-long.34

Rosann Webb Collins. 2001. Software localization: Issues and methods. (2001).

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. CoRR abs/1810.04805 (2018). arXiv:1810.04805 [link]

Xavier Garcia, Yamini Bansal, Colin Cherry, George Foster, Maxim Krikun, Melvin Johnson, and Orhan Firat. 2023. The Unreasonable Effectiveness of Few-shot Learning for Machine Translation. In Proceedings of the 40th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 202), Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (Eds.). PMLR, 10867–10878. [link]

Madeleine Grunde-McLaughlin, Michelle S. Lam, Ranjay Krishna, Daniel S.Weld, and Jeffrey Heer. 2023. Designing LLM Chains by Adapting Techniques from Crowdsourcing Workflows. arXiv:2312.11681 [cs.HC]

Jindong Gu, Zhen Han, Shuo Chen, Ahmad Beirami, Bailan He, Gengyuan Zhang, Ruotong Liao, Yao Qin, Volker Tresp, and Philip Torr. 2023. A systematic survey of prompt engineering on vision-language foundation models. arXiv preprint arXiv:2307.12980 (2023).

Jesse Michael Han, Igor Babuschkin, Harrison Edwards, Arvind Neelakantan, Tao Xu, Stanislas Polu, Alex Ray, Pranav Shyam, Aditya Ramesh, Alec Radford, and Ilya Sutskever. 2021. Unsupervised Neural Machine Translation with Generative Language Models Only. arXiv:2110.05448 [cs.CL]

Kenneth Keniston. 1997. Software localization: Notes on technology and culture. Program in Science, Technology, and Society, Massachusetts Institute of Technology.

Sai Koneru, Matthias Huck, Miriam Exel, and Jan Niehues. 2023. Analyzing Challenges in Neural Machine Translation for Software Localization. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics. 2434–2446.

Junyi Li, Tianyi Tang, Wayne Xin Zhao, Jian-Yun Nie, and Ji-Rong Wen. 2022. Pretrained language models for text generation: A survey. arXiv preprint arXiv:2201.05273 (2022).

Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan McDonald. 2020. On faithfulness and factuality in abstractive summarization. arXiv preprint arXiv:2005.00661 (2020).

Fathi M Salem. 2022. Recurrent Neural Networks. Springer.

Felix Stahlberg. 2020. Neural Machine Translation: A Review and Survey. arXiv:1912.02047 [cs.CL]

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023).

Sowmya Vajjala, Bodhisattwa Majumder, Anuj Gupta, and Harshit Surana. 2020. Practical natural language processing: A comprehensive guide to building real-world NLP systems. O’Reilly Media.

Greg Van Houdt, Carlos Mosquera, and Gonzalo Nápoles. 2020. A review on the long short-term memory model. Artificial Intelligence Review 53 (2020), 5929–5955.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems, I. Guyon, U. Von Luxburg, S. Bengio, H.Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30. Curran Associates, Inc. [link]

Haifeng Wang, Jiwei Li, Hua Wu, Eduard Hovy, and Yu Sun. 2022. Pre-trained language models and their applications. Engineering (2022).

Junjie Wang, Yuchao Huang, Chunyang Chen, Zhe Liu, Song Wang, and Qing Wang. 2024. Software Testing with Large Language Models: Survey, Landscape, and Vision. arXiv:2307.07221 [cs.SE]

XuWang, Chunyang Chen, and Zhenchang Xing. 2019. Domain-specific machine translation with recurrent neural network for software localization. Empirical Software Engineering 24 (2019), 3514–3545.

Tongshuang Wu, Haiyi Zhu, Maya Albayrak, Alexis Axon, Amanda Bertsch, Wenxing Deng, Ziqi Ding, Bill Guo, Sireesh Gururaja, Tzu-Sheng Kuo, Jenny T. Liang, Ryan Liu, Ihita Mandal, Jeremiah Milbauer, Xiaolin Ni, Namrata Padmanabhan, Subhashini Ramkumar, Alexis Sudjianto, Jordan Taylor, Ying-Jui Tseng, Patricia Vaidos, ZhijinWu,WeiWu, and Chenyang Yang. 2023. LLMs asWorkers in Human-Computational Algorithms? Replicating Crowdsourcing Pipelines with LLMs. arXiv:2307.10168 [cs.CL]

Shuoheng Yang, YuxinWang, and Xiaowen Chu. 2020. A Survey of Deep Learning Techniques for Neural Machine Translation. arXiv:2002.07526 [cs.CL]
Publicado
30/09/2024
SOUSA, Jose Leandro; SOUZA, Cristian; HANADA, Raiza; NASCIMENTO, Diogo; COLLINS, Eliane. Generation of test datasets using LLM - Quality Assurance Perspective. In: SIMPÓSIO BRASILEIRO DE ENGENHARIA DE SOFTWARE (SBES), 38. , 2024, Curitiba/PR. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2024 . p. 641-647. DOI: https://doi.org/10.5753/sbes.2024.3587.