Large Language Models for Structured Chest CT Reporting in Portuguese: A Comparative Study with Radiologist Validation

Juliana Petruceli; Marcelo Oliveira; Tarcisio Ferreira; Jose Arthur Sabino

doi:10.5753/sbcas.2026.20861

Juliana Petruceli UFAL
Marcelo Oliveira UFAL
Tarcisio Ferreira UFAL
Jose Arthur Sabino UFAL

DOI: https://doi.org/10.5753/sbcas.2026.20861

Resumo

To evaluate LLMs for converting free-text Portuguese chest CT reports into structured JSON for clinical communication and data reuse. Gemini 1.5 Flash, GPT-4o, and LLaMA 3.3 were tested on 1,102 de-identified reports using a dynamic JSON template with few-shot prompting. Validation combined radiologist review and quantitative metrics. All models produced coherent structured outputs. Gemini achieved the best agreement (macro-F1 0.852; micro-F1 0.853), followed by LLaMA (0.806; 0.809) and GPT-4o (0.797;0.798). LLM-assisted structuring of Portuguese chest CT reports is feasible and attains high agreement with manual references; section-aware prompting and JSON validation improve robustness.

Referências

Adams, L. C., Truhn, D., Busch, F., Kader, A., Niehues, S. M., Makowski, M. R., et al. (2023). Leveraging GPT-4 for post hoc transformation of free-text radiology reports into structured reporting: A multilingual feasibility study. Radiology, 307(4):e230725.

Bhayana, R. (2024). Chatbots and large language models in radiology: A practical primer for clinical and research applications. Radiology, 310(1):e232756.

Boateng, G. O., Neilands, T. B., Frongillo, E. A., Melgar-Quiñonez, H. R., and Young, S. L. (2018). Best practices for developing and validating scales for health, social, and behavioral research: A primer. Frontiers in Public Health, 6:149.

Bosbach, W. A., Senge, J. F., Nemeth, B., Omar, S. H., Mitrakovic, M., Beisbart, C., et al. (2024). Ability of ChatGPT to generate competent radiology reports for distal radius fracture by use of RSNA template items and integrated AO classifier. Current Problems in Diagnostic Radiology, 53(1):102–110.

Dorfner, F. J., Jürgensen, L., Donle, L., Mohamad, F. A., Bodenmann, T. R., Cleveland, M. C., et al. (2024). Comparing commercial and open-source large language models for labeling chest radiograph reports. Radiology, 313(1):e241139.

Elvas, L. B., Almeida, A., and Ferreira, J. C. (2025). Natural language processing in medical text processing: A scoping literature review. International Journal of Medical Informatics, 204:106049.

Fink, M. A., Bischoff, A., Fink, C. A., Moll, M., Kroschke, J., Dulz, L., et al. (2023). Potential of ChatGPT and GPT-4 for data mining of free-text CT reports on lung cancer. Radiology, 308(3):e231362.

Goldberg-Stein, S. and Chernyak, V. (2019). Adding value in radiology reporting. Journal of the American College of Radiology, 16(9):1292–1298. Pt B.

McFarland, J. A., Elkassem, A. M. A., Casals, L., Smith, G. D., Smith, A. D., and Gunn, A. J. (2021). Objective comparison of errors and report length between structured and freeform abdominopelvic computed tomography reports. Abdominal Radiology, 46(1):387–393.

Mozayan, A., Fabbri, A. R., Maneevese, M., Tocino, I., and Chheang, S. (2021). Practical guide to natural language processing for radiology. RadioGraphics, 41(5):1446–1453.

Nobel, J. M., Geel, K. V., and Robben, S. G. F. (2022). Structured reporting in radiology: a systematic review to explore its potential. European Radiology, 32(4):2837–2854.

Pesapane, F., Tantrige, P., Marco, P. D., Carriero, S., Zugni, F., Nicosia, L., et al. (2023). Advancements in standardizing radiological reports: A comprehensive review. Medicina, 59(9):1679.

Russe, M. F., Reisert, M., Bamberg, F., and Rau, A. (2024). Improving the use of LLMs in radiology through prompt engineering: from precision prompts to zero-shot learning. RöFo - Fortschritte auf dem Gebiet der Röntgenstrahlen und der Bildgebenden Verfahren, pages a–2264–5631.

Shah, S. V. (2024). Accuracy, consistency, and hallucination of large language models when analyzing unstructured clinical notes in electronic medical records. JAMA Network Open, 7(8):e2425953.

Spandorfer, A., Branch, C., Sharma, P., Sahbaee, P., Schoepf, U. J., Ravenel, J. G., et al. (2019). Deep learning to convert unstructured CT pulmonary angiography reports into structured reports. European Radiology Experimental, 3(1).

Takita, H., Walston, S. L., Mitsuyama, Y., Watanabe, K., Ishimaru, S., and Ueda, D. (2025). Comparative performance of large language models in structuring head CT radiology reports: multi-institutional validation study in japan. Japanese Journal of Radiology.

Tam, T. Y. C., Sivarajkumar, S., Kapoor, S., Stolyar, A. V., Polanska, K., McCarthy, K. R., et al. (2024). A framework for human evaluation of large language models in healthcare derived from literature review. npj Digital Medicine, 7(1):258.

Woźnicki, P., Laqua, C., Fiku, I., Hekalo, A., Truhn, D., Engelhardt, S., et al. (2024). Automatic structuring of radiology reports with on-premise open-source large language models. European Radiology, 35(4):2018–2029.

Large Language Models for Structured Chest CT Reporting in Portuguese: A Comparative Study with Radiologist Validation

Resumo

Referências

Artigos mais lidos do(s) mesmo(s) autor(es)