Geração de laudos de retinografia utilizando Contrastive Captioner
Resumo
A geração automática de laudos de retinografia atua como suporte médico, permitindo o diagnóstico de doenças oculares com maior agilidade se comparado a métodos tradicionais, reduzindo o tempo de espera dos pacientes com doenças oculares e contribuindo para a diminuição de casos de deficiência visual. Modelos recentes de geração de laudos propõem novos métodos para integração de informações visuais e textuais, apresentando dependência de palavras-chave para a geração das descrições clínicas. Neste trabalho, exploramos o Contrastive Captioner (CoCa) pré-treinado, visando correlacionar imagem e texto por meio da combinação das duas funções de perda presentes no modelo, visando gerar laudos de retinografias sem depender de palavras-chave. Nos experimentos realizados no dataset DeepEyeNet o método alcançou um BLEU-4 de 0,230, CIDEr de 0,517, e METEOR de 0,423.Referências
Arar, M., Shamir, A., and Bermano, A. H. (2022). Learned queries for efficient local attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10841–10852.
Banerjee, S. and Lavie, A. (2005). Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72.
Bordes, F., Pang, R. Y., Ajay, A., Li, A. C., Bardes, A., Petryk, S., Mañas, O., Lin, Z., Mahmoud, A., Jayaraman, B., et al. (2024). An introduction to vision-language modeling. arXiv preprint arXiv:2405.17247.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
Dutra, E. F., de Lemos, V. H., Almeida, J. D., and de Paiva, A. C. (2024). Método automático para geração de laudos médicos em imagens de retinografia utilizando transformer. In Simpósio Brasileiro de Computação Aplicada à Saúde (SBCAS), pages 507–518. SBC.
Hong, H., Mújica, O. J., Anaya, J., Lansingh, V. C., López, E., and Silva, J. C. (2016). The challenge of universal eye health in latin america: distributive inequality of ophthalmologists in 14 countries. BMJ open, 6(11):e012819.
Huang, J.-H., Wu, T.-W., Yang, C.-H. H., Shi, Z., Lin, I., Tegner, J., Worring, M., et al. (2022). Non-local attention improves description generation for retinal images. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 1606–1615.
Huang, J.-H., Yang, C.-H. H., Liu, F., Tian, M., Liu, Y.-C., Wu, T.-W., Lin, I., Wang, K., Morikawa, H., Chang, H., et al. (2021). Deepopht: medical report generation for retinal images via deep models and visual explanation. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 2442–2452.
Iqbal, S., Khan, T. M., Naveed, K., Naqvi, S. S., and Nawaz, S. J. (2022). Recent trends and advances in fundus image analysis: A review. Computers in Biology and Medicine, 151:106277.
Krishna Cherukuri, T., Shareef Shaik, N., Devi Bodapati, J., and Hye Ye, D. (2024). Gcs-m3vlt: Guided context self-attention based multi-modal medical vision language transformer for retinal image captioning. arXiv e-prints, pages arXiv–2412.
Li, C., Wong, C., Zhang, S., Usuyama, N., Liu, H., Yang, J., Naumann, T., Poon, H., and Gao, J. (2023). Llava-med: Training a large language-and-vision assistant for biomedicine in one day. Advances in Neural Information Processing Systems, 36:28541–28564.
Lin, C.-Y. (2004). Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81.
Loshchilov, I. and Hutter, F. (2017). Decoupled weight decay regularization. arXiv pre-print arXiv:1711.05101.
Mao, A., Mohri, M., and Zhong, Y. (2023). Cross-entropy loss functions: Theoretical analysis and applications. In International conference on Machine learning, pages 23803–23828. PMLR.
Monasse, P. (2019). Extraction of the Level Lines of a Bilinear Image. Image Processing On Line, 9:205–219. DOI: 10.5201/ipol.2019.269.
Organization, W. H. et al. (2019). World report on vision.
Pascanu, R., Mikolov, T., and Bengio, Y. (2013). On the difficulty of training recurrent neural networks.
Plummer, B. A., Wang, L., Cervantes, C. M., Caicedo, J. C., Hockenmaier, J., and Lazebnik, S. (2015). Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE international conference on computer vision, pages 2641–2649.
Post, M. (2018). A call for clarity in reporting bleu scores. arXiv preprint arXiv:1804.08771.
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. (2021). Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PmLR.
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al. (2019). Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
Schmidhuber, J., Hochreiter, S., et al. (1997). Long short-term memory. Neural Comput, 9(8):1735–1780.
Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., Wortsman, M., et al. (2022). Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in neural information processing systems, 35:25278–25294.
Sennrich, R., Haddow, B., and Birch, A. (2016). Neural machine translation of rare words with subword units. In Erk, K. and Smith, N. A., editors, Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715–1725, Berlin, Germany. Association for Computational Linguistics.
Shaik, N. S., Cherukuri, T. K., and Ye, D. H. (2024). M3t: Multi-modal medical transformer to bridge clinical context with visual insights for retinal image medical description generation. In 2024 IEEE International Conference on Image Processing (ICIP), pages 3037–3043. IEEE.
Singh, A., Hu, R., Goswami, V., Couairon, G., Galuba, W., Rohrbach, M., and Kiela, D. (2022). Flava: A foundational language and vision alignment model. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15638–15650.
Teo, Z. L., Tham, Y.-C., Yu, M., Chee, M. L., Rim, T. H., Cheung, N., Bikbov, M. M., Wang, Y. X., Tang, Y., Lu, Y., Wong, I. Y., Ting, D. S. W., Tan, G. S. W., Jonas, J. B., Sabanayagam, C., Wong, T. Y., and Cheng, C.-Y. (2021). Global prevalence of diabetic retinopathy and projection of burden through 2045: Systematic review and meta-analysis. Ophthalmology, 128(11):1580–1591.
Topal, M. O., Bas, A., and van Heerden, I. (2021). Exploring transformers in natural language generation: Gpt, bert, and xlnet. arXiv preprint arXiv:2102.08036.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. (2023). Attention is all you need.
Vedantam, R., Lawrence Zitnick, C., and Parikh, D. (2015). Cider: Consensus-based image description evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Wang, X., Girshick, R., Gupta, A., and He, K. (2018). Non-local neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7794–7803.
Yang, J., Li, C., Zhang, P., Xiao, B., Liu, C., Yuan, L., and Gao, J. (2022). Unified contrastive learning in image-text-label space. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19163–19173.
Yu, J., Wang, Z., Vasudevan, V., Yeung, L., Seyedhosseini, M., and Wu, Y. (2022). Coca: Contrastive captioners are image-text foundation models.
Zheng, Y., He, M., and Congdon, N. (2012). The worldwide epidemic of diabetic retinopathy. Indian journal of ophthalmology, 60(5):428–431.
Banerjee, S. and Lavie, A. (2005). Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72.
Bordes, F., Pang, R. Y., Ajay, A., Li, A. C., Bardes, A., Petryk, S., Mañas, O., Lin, Z., Mahmoud, A., Jayaraman, B., et al. (2024). An introduction to vision-language modeling. arXiv preprint arXiv:2405.17247.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
Dutra, E. F., de Lemos, V. H., Almeida, J. D., and de Paiva, A. C. (2024). Método automático para geração de laudos médicos em imagens de retinografia utilizando transformer. In Simpósio Brasileiro de Computação Aplicada à Saúde (SBCAS), pages 507–518. SBC.
Hong, H., Mújica, O. J., Anaya, J., Lansingh, V. C., López, E., and Silva, J. C. (2016). The challenge of universal eye health in latin america: distributive inequality of ophthalmologists in 14 countries. BMJ open, 6(11):e012819.
Huang, J.-H., Wu, T.-W., Yang, C.-H. H., Shi, Z., Lin, I., Tegner, J., Worring, M., et al. (2022). Non-local attention improves description generation for retinal images. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 1606–1615.
Huang, J.-H., Yang, C.-H. H., Liu, F., Tian, M., Liu, Y.-C., Wu, T.-W., Lin, I., Wang, K., Morikawa, H., Chang, H., et al. (2021). Deepopht: medical report generation for retinal images via deep models and visual explanation. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 2442–2452.
Iqbal, S., Khan, T. M., Naveed, K., Naqvi, S. S., and Nawaz, S. J. (2022). Recent trends and advances in fundus image analysis: A review. Computers in Biology and Medicine, 151:106277.
Krishna Cherukuri, T., Shareef Shaik, N., Devi Bodapati, J., and Hye Ye, D. (2024). Gcs-m3vlt: Guided context self-attention based multi-modal medical vision language transformer for retinal image captioning. arXiv e-prints, pages arXiv–2412.
Li, C., Wong, C., Zhang, S., Usuyama, N., Liu, H., Yang, J., Naumann, T., Poon, H., and Gao, J. (2023). Llava-med: Training a large language-and-vision assistant for biomedicine in one day. Advances in Neural Information Processing Systems, 36:28541–28564.
Lin, C.-Y. (2004). Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81.
Loshchilov, I. and Hutter, F. (2017). Decoupled weight decay regularization. arXiv pre-print arXiv:1711.05101.
Mao, A., Mohri, M., and Zhong, Y. (2023). Cross-entropy loss functions: Theoretical analysis and applications. In International conference on Machine learning, pages 23803–23828. PMLR.
Monasse, P. (2019). Extraction of the Level Lines of a Bilinear Image. Image Processing On Line, 9:205–219. DOI: 10.5201/ipol.2019.269.
Organization, W. H. et al. (2019). World report on vision.
Pascanu, R., Mikolov, T., and Bengio, Y. (2013). On the difficulty of training recurrent neural networks.
Plummer, B. A., Wang, L., Cervantes, C. M., Caicedo, J. C., Hockenmaier, J., and Lazebnik, S. (2015). Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE international conference on computer vision, pages 2641–2649.
Post, M. (2018). A call for clarity in reporting bleu scores. arXiv preprint arXiv:1804.08771.
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. (2021). Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PmLR.
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al. (2019). Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
Schmidhuber, J., Hochreiter, S., et al. (1997). Long short-term memory. Neural Comput, 9(8):1735–1780.
Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., Wortsman, M., et al. (2022). Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in neural information processing systems, 35:25278–25294.
Sennrich, R., Haddow, B., and Birch, A. (2016). Neural machine translation of rare words with subword units. In Erk, K. and Smith, N. A., editors, Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715–1725, Berlin, Germany. Association for Computational Linguistics.
Shaik, N. S., Cherukuri, T. K., and Ye, D. H. (2024). M3t: Multi-modal medical transformer to bridge clinical context with visual insights for retinal image medical description generation. In 2024 IEEE International Conference on Image Processing (ICIP), pages 3037–3043. IEEE.
Singh, A., Hu, R., Goswami, V., Couairon, G., Galuba, W., Rohrbach, M., and Kiela, D. (2022). Flava: A foundational language and vision alignment model. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15638–15650.
Teo, Z. L., Tham, Y.-C., Yu, M., Chee, M. L., Rim, T. H., Cheung, N., Bikbov, M. M., Wang, Y. X., Tang, Y., Lu, Y., Wong, I. Y., Ting, D. S. W., Tan, G. S. W., Jonas, J. B., Sabanayagam, C., Wong, T. Y., and Cheng, C.-Y. (2021). Global prevalence of diabetic retinopathy and projection of burden through 2045: Systematic review and meta-analysis. Ophthalmology, 128(11):1580–1591.
Topal, M. O., Bas, A., and van Heerden, I. (2021). Exploring transformers in natural language generation: Gpt, bert, and xlnet. arXiv preprint arXiv:2102.08036.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. (2023). Attention is all you need.
Vedantam, R., Lawrence Zitnick, C., and Parikh, D. (2015). Cider: Consensus-based image description evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Wang, X., Girshick, R., Gupta, A., and He, K. (2018). Non-local neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7794–7803.
Yang, J., Li, C., Zhang, P., Xiao, B., Liu, C., Yuan, L., and Gao, J. (2022). Unified contrastive learning in image-text-label space. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19163–19173.
Yu, J., Wang, Z., Vasudevan, V., Yeung, L., Seyedhosseini, M., and Wu, Y. (2022). Coca: Contrastive captioners are image-text foundation models.
Zheng, Y., He, M., and Congdon, N. (2012). The worldwide epidemic of diabetic retinopathy. Indian journal of ophthalmology, 60(5):428–431.
Publicado
09/06/2025
Como Citar
PIMENTEL, Patrik O.; ALMEIDA, Mauricio M.; ALMEIDA, João D. S.; LEMOS, Victor H. B. de; MARTINS, Luis Eduardo S. C..
Geração de laudos de retinografia utilizando Contrastive Captioner. In: SIMPÓSIO BRASILEIRO DE COMPUTAÇÃO APLICADA À SAÚDE (SBCAS), 25. , 2025, Porto Alegre/RS.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2025
.
p. 850-861.
ISSN 2763-8952.
DOI: https://doi.org/10.5753/sbcas.2025.7824.