Geração de laudos de retinografia utilizando Contrastive Captioner

Patrik O. Pimentel; Mauricio M. Almeida; João D. S. Almeida; Victor H. B. de Lemos; Luis Eduardo S. C. Martins

doi:10.5753/sbcas.2025.7824

Patrik O. Pimentel UFMA
Mauricio M. Almeida UFMA
João D. S. Almeida UFMA
Victor H. B. de Lemos UFMA
Luis Eduardo S. C. Martins UFMA

DOI: https://doi.org/10.5753/sbcas.2025.7824

Abstract

Automatic generation of fundus reports acts as medical support, allowing the diagnosis of eye diseases more quickly when compared to traditional methods, reducing the waiting time of patients with eye diseases and contributing to the reduction of cases of visual impairment. Recent models of report generation propose new methods for integrating visual and textual information, presenting dependence on keywords for the generation of clinical descriptions. In this work, we explore the pre-trained Contrastive Captioner (CoCa), aiming to correlate image and text by combining the two loss functions present in the model, with the objective of generating fundus reports without depending on keywords. In the experiments performed on the DeepEyeNet dataset, the method achieved a BLEU-4 of 0.230, CIDEr of 0.517, and METEOR of 0.423.

References

Arar, M., Shamir, A., and Bermano, A. H. (2022). Learned queries for efficient local attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10841–10852.

Banerjee, S. and Lavie, A. (2005). Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72.

Bordes, F., Pang, R. Y., Ajay, A., Li, A. C., Bardes, A., Petryk, S., Mañas, O., Lin, Z., Mahmoud, A., Jayaraman, B., et al. (2024). An introduction to vision-language modeling. arXiv preprint arXiv:2405.17247.

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.

Dutra, E. F., de Lemos, V. H., Almeida, J. D., and de Paiva, A. C. (2024). Método automático para geração de laudos médicos em imagens de retinografia utilizando transformer. In Simpósio Brasileiro de Computação Aplicada à Saúde (SBCAS), pages 507–518. SBC.

Hong, H., Mújica, O. J., Anaya, J., Lansingh, V. C., López, E., and Silva, J. C. (2016). The challenge of universal eye health in latin america: distributive inequality of ophthalmologists in 14 countries. BMJ open, 6(11):e012819.

Huang, J.-H., Wu, T.-W., Yang, C.-H. H., Shi, Z., Lin, I., Tegner, J., Worring, M., et al. (2022). Non-local attention improves description generation for retinal images. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 1606–1615.

Huang, J.-H., Yang, C.-H. H., Liu, F., Tian, M., Liu, Y.-C., Wu, T.-W., Lin, I., Wang, K., Morikawa, H., Chang, H., et al. (2021). Deepopht: medical report generation for retinal images via deep models and visual explanation. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 2442–2452.

Iqbal, S., Khan, T. M., Naveed, K., Naqvi, S. S., and Nawaz, S. J. (2022). Recent trends and advances in fundus image analysis: A review. Computers in Biology and Medicine, 151:106277.

Krishna Cherukuri, T., Shareef Shaik, N., Devi Bodapati, J., and Hye Ye, D. (2024). Gcs-m3vlt: Guided context self-attention based multi-modal medical vision language transformer for retinal image captioning. arXiv e-prints, pages arXiv–2412.

Li, C., Wong, C., Zhang, S., Usuyama, N., Liu, H., Yang, J., Naumann, T., Poon, H., and Gao, J. (2023). Llava-med: Training a large language-and-vision assistant for biomedicine in one day. Advances in Neural Information Processing Systems, 36:28541–28564.

Lin, C.-Y. (2004). Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81.

Loshchilov, I. and Hutter, F. (2017). Decoupled weight decay regularization. arXiv pre-print arXiv:1711.05101.

Mao, A., Mohri, M., and Zhong, Y. (2023). Cross-entropy loss functions: Theoretical analysis and applications. In International conference on Machine learning, pages 23803–23828. PMLR.

Monasse, P. (2019). Extraction of the Level Lines of a Bilinear Image. Image Processing On Line, 9:205–219. DOI: 10.5201/ipol.2019.269.

Organization, W. H. et al. (2019). World report on vision.

Pascanu, R., Mikolov, T., and Bengio, Y. (2013). On the difficulty of training recurrent neural networks.

Plummer, B. A., Wang, L., Cervantes, C. M., Caicedo, J. C., Hockenmaier, J., and Lazebnik, S. (2015). Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE international conference on computer vision, pages 2641–2649.

Post, M. (2018). A call for clarity in reporting bleu scores. arXiv preprint arXiv:1804.08771.

Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. (2021). Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PmLR.

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al. (2019). Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.

Schmidhuber, J., Hochreiter, S., et al. (1997). Long short-term memory. Neural Comput, 9(8):1735–1780.

Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., Wortsman, M., et al. (2022). Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in neural information processing systems, 35:25278–25294.

Sennrich, R., Haddow, B., and Birch, A. (2016). Neural machine translation of rare words with subword units. In Erk, K. and Smith, N. A., editors, Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715–1725, Berlin, Germany. Association for Computational Linguistics.

Shaik, N. S., Cherukuri, T. K., and Ye, D. H. (2024). M3t: Multi-modal medical transformer to bridge clinical context with visual insights for retinal image medical description generation. In 2024 IEEE International Conference on Image Processing (ICIP), pages 3037–3043. IEEE.

Singh, A., Hu, R., Goswami, V., Couairon, G., Galuba, W., Rohrbach, M., and Kiela, D. (2022). Flava: A foundational language and vision alignment model. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15638–15650.

Teo, Z. L., Tham, Y.-C., Yu, M., Chee, M. L., Rim, T. H., Cheung, N., Bikbov, M. M., Wang, Y. X., Tang, Y., Lu, Y., Wong, I. Y., Ting, D. S. W., Tan, G. S. W., Jonas, J. B., Sabanayagam, C., Wong, T. Y., and Cheng, C.-Y. (2021). Global prevalence of diabetic retinopathy and projection of burden through 2045: Systematic review and meta-analysis. Ophthalmology, 128(11):1580–1591.

Topal, M. O., Bas, A., and van Heerden, I. (2021). Exploring transformers in natural language generation: Gpt, bert, and xlnet. arXiv preprint arXiv:2102.08036.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. (2023). Attention is all you need.

Vedantam, R., Lawrence Zitnick, C., and Parikh, D. (2015). Cider: Consensus-based image description evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

Wang, X., Girshick, R., Gupta, A., and He, K. (2018). Non-local neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7794–7803.

Yang, J., Li, C., Zhang, P., Xiao, B., Liu, C., Yuan, L., and Gao, J. (2022). Unified contrastive learning in image-text-label space. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19163–19173.

Yu, J., Wang, Z., Vasudevan, V., Yeung, L., Seyedhosseini, M., and Wu, Y. (2022). Coca: Contrastive captioners are image-text foundation models.

Zheng, Y., He, M., and Congdon, N. (2012). The worldwide epidemic of diabetic retinopathy. Indian journal of ophthalmology, 60(5):428–431.

Retinography Report Generation Using Contrastive Captioner

Abstract

References

Most read articles by the same author(s)