Will AI also replace inspectors? Investigating the potential of generative AIs in usability inspection

Abstract


Usability inspection is a well-established technique for identifying interaction issues in software interfaces, thereby contributing to improved product quality. However, it is a costly process that requires time and specialized knowledge from inspectors. With advances in Artificial Intelligence (AI), new opportunities have emerged to support this task, particularly through generative models capable of interpreting interfaces and performing inspections more efficiently. This study examines the performance of generative AIs in identifying usability problems, comparing them to those of experienced human inspectors. A software prototype was evaluated by four specialists and two AI models (GPT-4o and Gemini 2.5 Flash), using metrics such as precision, recall, and F1-score. While inspectors achieved the highest levels of precision and overall coverage, the AIs demonstrated high individual performance and discovered many novel defects, but with a higher rate of false positives and redundant reports. The combination of AIs and human inspectors produced the best results, revealing their complementarity. These findings suggest that AI, in its current stage, cannot replace human inspectors but can serve as a valuable augmentation tool to improve efficiency and expand defect coverage. The results provide evidence based on quantitative analysis to inform the discussion on the role of AI in usability inspections, pointing to viable paths for its complementary use in software quality assessment contexts.

Keywords: Usability, Inspection, Generative AI, LLM, Comparison, Experts

References

Abdallah M. H. Abbas, Khairil Imran Ghauth, and Choo-Yee Ting. 2022. User Experience Design Using Machine Learning: A Systematic Review. IEEE Access 10 (2022), 51501–51514. DOI: 10.1109/ACCESS.2022.3173289

Alba Bisante, Venkata Srikanth Varma Datla, Emanuele Panizzi, Gabriella Trasciatti, and Stefano Zeppieri. 2024. Enhancing Interface Design with AI: An Exploratory Study on a ChatGPT-4-Based Tool for Cognitive Walkthrough Inspired Evaluations. In Proceedings of the 2024 International Conference on Advanced Visual Interfaces (Arenzano, Genoa, Italy) (AVI ’24). Association for Computing Machinery, New York, NY, USA, Article 41, 5 pages. DOI: 10.1145/3656650.3656676

Maria T. Britto, Holly B. Jimison, Jennifer Knopf Munafo, Jennifer Wissman, Michelle L. Rogers, and William Hersh. 2009. Usability Testing Finds Problems for Novice Users of Pediatric Portals. Journal of the American Medical Informatics Association 16, 5 (09 2009), 660–669. arXiv: [link] DOI: 10.1197/jamia.M3154

Victor R Basili-Gianluigi Caldiera and H Dieter Rombach. 1994. Goal question metric paradigm. Encyclopedia of software engineering 1, 528-532 (1994), 6.

Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, CunxiangWang, YidongWang,Wei Ye, Yue Zhang, Yi Chang, Philip S. Yu, Qiang Yang, and Xing Xie. 2024. A Survey on Evaluation of Large Language Models. ACM Trans. Intell. Syst. Technol. 15, 3, Article 39 (March 2024), 45 pages. DOI: 10.1145/3641289

Lennon Chaves, Márcia Lima, and Tayana Conte. 2025. Solaria-GPT: A Tailored ChatGPT Tool for Usability Inspection. In Anais do XXXIX Simpósio Brasileiro de Engenharia de Software (Recife/PE). SBC, Porto Alegre, RS, Brasil, 956–962. DOI: 10.5753/sbes.2025.11455

Gilbert Cockton and Alan Woolrych. 2001. Understanding inspection methods: lessons from an assessment of heuristic evaluation. In People and Computers XV—Interaction without Frontiers: Joint Proceedings of HCI 2001 and IHM 2001. Springer, 171–191.

Maria Francesca Costabile and Maristella Matera. 2002. Guidelines for hypermedia usability inspection. IEEE MultiMedia 8, 1 (2002), 66–69.

Estelle de Kock, Judy van Biljon, and Marco Pretorius. 2009. Usability evaluation methods: mind the gaps. In Proceedings of the 2009 Annual Research Conference of the South African Institute of Computer Scientists and Information Technologists (Vanderbijlpark, Emfuleni, South Africa) (SAICSIT ’09). Association for Computing Machinery, New York, NY, USA, 122–131. DOI: 10.1145/1632149.1632166

Peitong Duan, Jeremy Warner, Yang Li, and Bjoern Hartmann. 2024. Generating Automatic Feedback on UI Mockups with Large Language Models. In Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems (Honolulu, HI, USA) (CHI ’24). Association for Computing Machinery, New York, NY, USA, Article 6, 20 pages. DOI: 10.1145/3613904.3642782

Christof Ebert and Panos Louridas. 2023. Generative AI for Software Practitioners. IEEE Software 40, 4 (2023), 30–38. DOI: 10.1109/MS.2023.3265877

Mingming Fan, Xianyou Yang, TszTung Yu, Q. Vera Liao, and Jian Zhao. 2022. Human-AI Collaboration for UX Evaluation: Effects of Explanation and Synchronization. Proceedings of the ACM on Human-Computer Interaction 6, CSCW1 (2022), 1–32. DOI: 10.1145/3512907

Ronald Aylmer Fisher. 1970. Statistical methods for research workers. In Breakthroughs in statistics: Methodology and distribution. Springer, 66–70.

Delia L Gold, Leslie K Mihalov, and Daniel M Cohen. 2014. Evaluating the Pediatric Early Warning Score (PEWS) system for admitted patients in the pediatric emergency department. Academic emergency medicine 21, 11 (2014), 1249–1256.

Guilherme Guerino, Luiz Rodrigues, Bruna Capeleti, Rafael Ferreira Mello, André Freire, and Luciana Zaina. 2025. Can GPT-4o Evaluate Usability Like Human Experts? A Comparative Study onnbsp;Issue Identification innbsp;Heuristic Evaluation. In Human-Computer Interaction – INTERACT 2025: 20th IFIP TC 13 International Conference, Belo Horizonte, Brazil, September 8–12, 2025, Proceedings, Part III (Belo Horizonte, Brazil). Springer-Verlag, Berlin, Heidelberg, 381–402. DOI: 10.1007/978-3-032-05005-2_20

Desta Haileselassie Hagos, Rick Battle, and Danda B. Rawat. 2024. Recent Advances in Generative AI and Large Language Models: Current Status, Challenges, and Perspectives. arXiv:2407.14962 [cs.CL] [link]

Perttu Hämäläinen, Mikke Tavast, and Anton Kunnari. 2023. Evaluating Large Language Models in Generating Synthetic HCI Research Data: a Case Study. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (Hamburg, Germany) (CHI ’23). Association for Computing Machinery, New York, NY, USA, Article 433, 19 pages. DOI: 10.1145/3544548.3580688

Layla Hasan, Anne Morris, and Steve Probets. 2012. A comparison of usability evaluation methods for evaluating e-commercewebsites. Behaviour & Information Technology 31, 7 (2012), 707–737. DOI: 10.1080/0144929X.2011.596996

Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. 2024. Large Language Models for Software Engineering: A Systematic Literature Review. ACM Trans. Softw. Eng. Methodol. 33, 8, Article 220 (Dec. 2024), 79 pages. DOI: 10.1145/3695988

Ebba Thora Hvannberg, Effie Lai-Chong Law, and Marta Kristín Lárusdóttir. 2007. Heuristic evaluation: Comparing ways of finding and reporting usability problems. Interacting with Computers 19, 2 (2007), 225–240. DOI: 10.1016/j.intcom.2006.10.001 HCI Issues in Computer Games.

Robin Jeffries, James R Miller, Cathleen Wharton, and Kathy Uyeda. 1991. User interface evaluation in the real world: a comparison of four techniques. In Proceedings of the SIGCHI conference on Human factors in computing systems. 119–124.

Emily Kuang, Minghao Li, Mingming Fan, and Kristen Shinohara. 2024. Enhancing UX Evaluation Through Collaboration with Conversational AI Assistants: Effects of Proactive Dialogue and Timing. In Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems (Honolulu, HI, USA) (CHI ’24). Association for Computing Machinery, New York, NY, USA, Article 3, 16 pages. DOI: 10.1145/3613904.3642168

Eduard Kuric, Peter Demcak, and Matus Krajcovic. 2024. Unmoderated Usability Studies Evolved: Can GPT Ask Useful Follow-up Questions? International Journal of Human–Computer Interaction 0, 0 (2024), 1–18. DOI: 10.1080/10447318.2024.2427978

Yunxing Liu and Jean-Bernard Martens. 2024. Conversation-based hybrid UI for the repertory grid technique: A lab experiment into automation of qualitative surveys. International Journal of Human-Computer Studies 184 (2024), 103227. DOI: 10.1016/j.ijhcs.2024.103227

Yuwen Lu, Yuewen Yang, Qinyi Zhao, Chengzhi Zhang, and Toby Jia-Jun Li. 2024. AI assistance for UX: A literature review through human-centered AI. arXiv preprint arXiv:2402.06089 (2024).

Yuxuan Lu, Bingsheng Yao, Hansu Gu, Jing Huang, Jessie Wang, Laurence Li, Jiri Gesi, Qi He, Toby Jia-Jun Li, and Dakuo Wang. 2025. UXAgent: An LLM Agent-Based Usability Testing Framework for Web Design. In Extended Abstracts of the CHI Conference on Human Factors in Computing Systems (CHI EA ’25) (Yokohama, Japan). ACM. DOI: 10.1145/3706599.3719729

Annika Meinecke, David Heidrich, Katharina Dworatzyk, and Sabine Theis. 2025. A Comparative Heuristic Evaluation of Kadi4Mat Through Human Evaluators and GPT-4. In HCI International 2024 – Late Breaking Papers, Aaron Marcus, Elizabeth Rosenzweig, Marcelo M. Soares, Pei-Luen Patrick Rau, and Abbas Moallem (Eds.). Springer Nature Switzerland, Cham, 91–108.

Rolf Molich and Jakob Nielsen. 1990. Improving a human-computer dialogue. Commun. ACM 33, 3 (1990), 338–348.

Walter Takashi Nakamura, Leonardo C Marques, Bruna Ferreira, Simone DJ Barbosa, and Tayana Conte. 2020. To inspect or to test? what approach provides better results when it comes to usability and UX?. In ICEIS (2). 487–498.

Muhammad Nasir, Naveed Ikram, and Zakia Jalil. 2022. Usability inspection: Novice crowd inspectors versus expert. Journal of Systems and Software 183 (2022), 111122.

Jakob Nielsen. 1994. Heuristic Evaluation. In Usability Inspection Methods, Jakob Nielsen and Robert L. Mack (Eds.). John Wiley & Sons, Inc., 25–62.

Jakob Nielsen. 1995. Technology transfer of heuristic evaluation and usability inspection. In International Conference on Human-Computer Interaction.

Calvin Kalun Or and Alan HS Chan. 2024. Inspection methods for usability evaluation. In User experience methods and tools in human-computer interaction. CRC Press, 170–192.

Yasmeen Saleh, Manar Abu Talib, Qassim Nasir, and Fatima Dakalbab. 2025. Evaluating large language models: a systematic review of efficiency, applications, and future directions. Frontiers in Computer Science 7 (2025), 1523699.

Sofia A.M. Silveira, Luciana A.M. Zaina, Leobino N. Sampaio, and Fábio L. Verdi. 2022. On the evaluation of usability design guidelines for improving network monitoring tools interfaces. Journal of Systems and Software 187 (2022), 111223. DOI: 10.1016/j.jss.2022.111223

Åsne Stige, Efpraxia D Zamani, Patrick Mikalef, and Yuzhen Zhu. 2024. Artificial intelligence (AI) for user experience (UX) design: a systematic literature review and future research agenda. Information Technology & People 37, 6 (2024), 2324– 2352.

Mahdi Takaffoli, Shunan Li, and Ville Mäkelä. 2024. Generative AI in User Experience Design and Research: How Do UX Practitioners, Teams, and Companies Use GenAI in Industry?. In Proceedings of the 2024 ACM Designing Interactive Systems Conference (DIS ’24). Association for Computing Machinery, New York, NY, USA, 1579–1593. DOI: 10.1145/3643834.3660720

Da Tao and Calvin Or. 2012. A paper prototype usability study of a chronic disease self-management system for older adults. In Proceedings of the IEEE International Conference on Industrial Engineering and Engineering Management (IEEM). IEEE, Hong Kong, 1262–1266.

Robert A. Virzi. 1990. Streamlining the design process: Running fewer subjects. In Proceedings of the Human Factors Society Annual Meeting, Vol. 34. SAGE Publications, 291–294.

Jules White, Sam Hays, Quchen Fu, Jesse Spencer-Smith, and Douglas C. Schmidt. 2024. ChatGPT Prompt Patterns for Improving Code Quality, Refactoring, Requirements Elicitation, and Software Design. Springer Nature Switzerland, Cham, 71–108. DOI: 10.1007/978-3-031-55642-5_4

Claes Wohlin, Per Runeson, Martin Höst, Magnus C Ohlsson, Björn Regnell, Anders Wesslén, et al. 2012. Experimentation in software engineering. Vol. 236. Springer.

Juyeon Yoon, Robert Feldt, and Shin Yoo. 2024. Intent-Driven Mobile GUI Testing with Autonomous Large Language Model Agents. In 2024 IEEE Conference on Software Testing, Verification and Validation (ICST). 129–139. DOI: 10.1109/ICST60714.2024.00020

Ziyao Zhang, ChongWang, YanlinWang, Ensheng Shi, Yuchi Ma,Wanjun Zhong, Jiachi Chen, Mingzhi Mao, and Zibin Zheng. 2025. LLM Hallucinations in Practical Code Generation: Phenomena, Mechanism, and Mitigation. Proc. ACM Softw. Eng. 2, ISSTA, Article ISSTA022 (June 2025), 23 pages. DOI: 10.1145/3728894

Zhibin Zhou, Yaoqi Li, and Junnan Yu. 2024. Exploring the application of LLMbased AI in UX design: an empirical case study of ChatGPT. Human–Computer Interaction (2024), 1–33.
Published
2025-11-04
CAMPOS, Luis F. G.; MARQUES, Leonardo C.; NAKAMURA, Walter T.. Will AI also replace inspectors? Investigating the potential of generative AIs in usability inspection. In: BRAZILIAN SOFTWARE QUALITY SYMPOSIUM (SBQS), 24. , 2025, São José dos Campos/SP. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2025 . p. 1-11. DOI: https://doi.org/10.5753/sbqs.2025.15060.