Evaluating LLMs and Prompting Strategies for Automated Hardware Diagnosis from Textual User-Reports
Resumo
Computer manufacturers offer platforms for users to describe device faults using textual reports such as “My screen is flickering”. Identifying the faulty component from the report is essential for automating tests and improving user experience. However, such reports are often ambiguous and lack detail, making this task challenging. Large Language Models (LLMs) have shown promise in addressing such issues. This study evaluates 27 open-source models (1B–72B parameters) and 2 proprietary LLMs using four prompting strategies: Zero-Shot, Few-Shot, Chain-of-Thought (CoT), and CoT+Few-Shot (CoT+FS). We conducted 98,948 inferences, processing over 51 million input tokens and generating 13 million output tokens. We achieve f1-score up to 0.76. Results show that three models offer the best balance between size and performance: mistral-small-24b-instruct and two smaller models, llama-3.2-1b-instruct and gemma-2-2b-it, that offer competitive performance with lower VRAM usage, enabling efficient inference on end-user devices as modern laptops or smartphones with NPUs.Referências
Abburi, H., Suesserman, M., Pudota, N., Veeramani, B., Bowen, E., and Bhattacharya, S. (2023). Generative ai text classification using ensemble llm approaches. arXiv preprint arXiv:2309.07755.
Almeida, F. C. and Caminha, C. (2024). Evaluation of entry-level open-source large language models for information extraction from digitized documents. In Symposium on Knowledge Discovery, Mining and Learning (KDMiLe), pages 25–32. SBC.
Bastos, Z., Freitas, J. D., Franco, J. W., and Caminha, C. (2025). Prompt-driven time series forecasting with large language models. In Proceedings of the 27th International Conference on Enterprise Information Systems, pages 309–316.
Efron, B. and Tibshirani, R. (1986). Bootstrap methods for standard errors, confidence intervals, and other measures of statistical accuracy. Statistical science, pages 54–75.
Hadi, M. U., Qureshi, R., Shah, A., Irfan, M., Zafar, A., Shaikh, M. B., Akhtar, N., Wu, J., Mirjalili, S., et al. (2023). A survey on large language models: Applications, challenges, limitations, and practical usage. Authorea Preprints, 3.
Ishizaka, A. and Nemery, P. (2013). Multi-criteria decision analysis: methods and software. John Wiley & Sons.
Karl, A. L., Fernandes, G. S., Pires, L. A., Serpa, Y. R., and Caminha, C. (2024). Synthetic ai data pipeline for domain-specific speech-to-text solutions. In Simpósio Brasileiro de Tecnologia da Informação e da Linguagem Humana (STIL), pages 37–47. SBC.
Li, Y., He, Y., Lian, R., and Guo, Q. (2023). Fault diagnosis and system maintenance based on large language models and knowledge graphs. In 2023 5th international conference on robotics, intelligent control and artificial intelligence (RICAI), pages 589–592. IEEE.
Lotov, A. V. and Miettinen, K. (2008). Visualizing the pareto frontier. In Multiobjective optimization: interactive and evolutionary approaches, pages 213–243. Springer.
Makram, M. and Mohammcd, A. (2024). Ai applications in medical reporting and diagnosis. In 2024 International Mobile, Intelligent, and Ubiquitous Computing Conference (MIUCC), pages 185–192. IEEE.
Nadaraya, E. A. (1964). On estimating regression. Theory of Probability & Its Applications, 9(1):141–142.
Nam, D., Macvean, A., Hellendoorn, V., Vasilescu, B., and Myers, B. (2024). Using an llm to help with code understanding. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering, pages 1–13.
Nathani, M., Soni, R., and Mishra, R. (2024). Knowledge distillation in mixture of experts for multi-modal medical llms. In 2024 IEEE International Conference on Big Data (BigData), pages 4367–4373. IEEE.
Pereira, F. L. F., Chaves, I. C., Gomes, J. P. P., and Machado, J. C. (2020). Using autoencoders for anomaly detection in hard disk drives. In 2020 international joint conference on neural networks (IJCNN), pages 1–7. IEEE.
Queiroz, L. P., Rodrigues, F. C. M., Gomes, J. P. P., Brito, F. T., Brito, I. C., and Machado, J. C. (2016a). Fault detection in hard disk drives based on mixture of gaussians. In 2016 5th Brazilian Conference on Intelligent Systems (BRACIS), pages 145–150. IEEE.
Queiroz, L. P., Rodrigues, F. C. M., Gomes, J. P. P., Brito, F. T., Chaves, I. C., Paula, M. R. P., Salvador, M. R., and Machado, J. C. (2016b). A fault detection method for hard disk drives based on mixture of gaussians and nonparametric statistics. IEEE Transactions on industrial informatics, 13(2):542–550.
Rasal, S. (2024). Llm harmony: Multi-agent communication for problem solving. arXiv preprint arXiv:2401.01312.
Silva, M. d. L. M., Mendonça, A. L. C., Neto, E. R. D., Chaves, I. C., Brito, F. T., Farias, V. A. E., and Machado, J. C. (2025). Classification of user reports for detection of faulty computer components using nlp models: A case study.
Silva, M. d. L. M., Mendonça, A. L. C., Neto, E. R. D., Chaves, I. C., Caminha, C., Brito, F. T., Farias, V. A. E., and Machado, J. C. (2024). Facto dataset: A dataset of user reports for faulty computer components. In Dataset Showcase Workshop (DSW), pages 91–102. SBC.
Tao, L., Liu, H., Ning, G., Cao, W., Huang, B., and Lu, C. (2025). Llm-based framework for bearing fault diagnosis. Mechanical Systems and Signal Processing, 224:112127.
Wang, L., Bi, W., Zhao, S., Ma, Y., Lv, L., Meng, C., Fu, J., Lv, H., et al. (2024). Investigating the impact of prompt engineering on the performance of large language models for standardizing obstetric diagnosis text: comparative study. JMIR formative research, 8(1):e53216.
Zheng, S., Pan, K., Liu, J., and Chen, Y. (2024). Empirical study on fine-tuning pre-trained large language models for fault diagnosis of complex systems. Reliability Engineering & System Safety, 252:110382.
Almeida, F. C. and Caminha, C. (2024). Evaluation of entry-level open-source large language models for information extraction from digitized documents. In Symposium on Knowledge Discovery, Mining and Learning (KDMiLe), pages 25–32. SBC.
Bastos, Z., Freitas, J. D., Franco, J. W., and Caminha, C. (2025). Prompt-driven time series forecasting with large language models. In Proceedings of the 27th International Conference on Enterprise Information Systems, pages 309–316.
Efron, B. and Tibshirani, R. (1986). Bootstrap methods for standard errors, confidence intervals, and other measures of statistical accuracy. Statistical science, pages 54–75.
Hadi, M. U., Qureshi, R., Shah, A., Irfan, M., Zafar, A., Shaikh, M. B., Akhtar, N., Wu, J., Mirjalili, S., et al. (2023). A survey on large language models: Applications, challenges, limitations, and practical usage. Authorea Preprints, 3.
Ishizaka, A. and Nemery, P. (2013). Multi-criteria decision analysis: methods and software. John Wiley & Sons.
Karl, A. L., Fernandes, G. S., Pires, L. A., Serpa, Y. R., and Caminha, C. (2024). Synthetic ai data pipeline for domain-specific speech-to-text solutions. In Simpósio Brasileiro de Tecnologia da Informação e da Linguagem Humana (STIL), pages 37–47. SBC.
Li, Y., He, Y., Lian, R., and Guo, Q. (2023). Fault diagnosis and system maintenance based on large language models and knowledge graphs. In 2023 5th international conference on robotics, intelligent control and artificial intelligence (RICAI), pages 589–592. IEEE.
Lotov, A. V. and Miettinen, K. (2008). Visualizing the pareto frontier. In Multiobjective optimization: interactive and evolutionary approaches, pages 213–243. Springer.
Makram, M. and Mohammcd, A. (2024). Ai applications in medical reporting and diagnosis. In 2024 International Mobile, Intelligent, and Ubiquitous Computing Conference (MIUCC), pages 185–192. IEEE.
Nadaraya, E. A. (1964). On estimating regression. Theory of Probability & Its Applications, 9(1):141–142.
Nam, D., Macvean, A., Hellendoorn, V., Vasilescu, B., and Myers, B. (2024). Using an llm to help with code understanding. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering, pages 1–13.
Nathani, M., Soni, R., and Mishra, R. (2024). Knowledge distillation in mixture of experts for multi-modal medical llms. In 2024 IEEE International Conference on Big Data (BigData), pages 4367–4373. IEEE.
Pereira, F. L. F., Chaves, I. C., Gomes, J. P. P., and Machado, J. C. (2020). Using autoencoders for anomaly detection in hard disk drives. In 2020 international joint conference on neural networks (IJCNN), pages 1–7. IEEE.
Queiroz, L. P., Rodrigues, F. C. M., Gomes, J. P. P., Brito, F. T., Brito, I. C., and Machado, J. C. (2016a). Fault detection in hard disk drives based on mixture of gaussians. In 2016 5th Brazilian Conference on Intelligent Systems (BRACIS), pages 145–150. IEEE.
Queiroz, L. P., Rodrigues, F. C. M., Gomes, J. P. P., Brito, F. T., Chaves, I. C., Paula, M. R. P., Salvador, M. R., and Machado, J. C. (2016b). A fault detection method for hard disk drives based on mixture of gaussians and nonparametric statistics. IEEE Transactions on industrial informatics, 13(2):542–550.
Rasal, S. (2024). Llm harmony: Multi-agent communication for problem solving. arXiv preprint arXiv:2401.01312.
Silva, M. d. L. M., Mendonça, A. L. C., Neto, E. R. D., Chaves, I. C., Brito, F. T., Farias, V. A. E., and Machado, J. C. (2025). Classification of user reports for detection of faulty computer components using nlp models: A case study.
Silva, M. d. L. M., Mendonça, A. L. C., Neto, E. R. D., Chaves, I. C., Caminha, C., Brito, F. T., Farias, V. A. E., and Machado, J. C. (2024). Facto dataset: A dataset of user reports for faulty computer components. In Dataset Showcase Workshop (DSW), pages 91–102. SBC.
Tao, L., Liu, H., Ning, G., Cao, W., Huang, B., and Lu, C. (2025). Llm-based framework for bearing fault diagnosis. Mechanical Systems and Signal Processing, 224:112127.
Wang, L., Bi, W., Zhao, S., Ma, Y., Lv, L., Meng, C., Fu, J., Lv, H., et al. (2024). Investigating the impact of prompt engineering on the performance of large language models for standardizing obstetric diagnosis text: comparative study. JMIR formative research, 8(1):e53216.
Zheng, S., Pan, K., Liu, J., and Chen, Y. (2024). Empirical study on fine-tuning pre-trained large language models for fault diagnosis of complex systems. Reliability Engineering & System Safety, 252:110382.
Publicado
20/07/2025
Como Citar
CAMINHA, Carlos; SILVA, Maria de Lourdes M.; CHAVES, Iago C.; BRITO, Felipe T.; FARIAS, Victor A. E.; MACHADO, Javam C..
Evaluating LLMs and Prompting Strategies for Automated Hardware Diagnosis from Textual User-Reports. In: SEMINÁRIO INTEGRADO DE SOFTWARE E HARDWARE (SEMISH), 52. , 2025, Maceió/AL.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2025
.
p. 287-298.
ISSN 2595-6205.
DOI: https://doi.org/10.5753/semish.2025.8473.
