Instance Segmentation in Medical Imaging: A Comparative Study of CNN and Transformer-Based Models in a Teledermatology Study-Case
Resumo
The rapid evolution of instance segmentation models necessitates empirical comparisons to guide their adoption in critical domains like medical imaging. This study evaluates four state-of-the-art architectures—Mask R-CNN, Mask2Former, YOLOv11, and YOLOv12 on a teledermatological dataset annotated for compliance-driven segmentation of rulers and patient information tags. Results demonstrate that transformer-based and hybrid models (Mask2Former, YOLOv11) significantly outperform traditional CNNs in precision-driven metrics (AP75), highlighting their suitability for medical applications. This work provides actionable insights for model selection in healthcare, emphasizing the balance between accuracy and computational efficiency.Referências
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., and Zagoruyko, S. (2020). End-to-end object detection with transformers. [link].
Cheng, B., Misra, I., Schwing, A. G., Kirillov, A., and Girdhar, R. (2022). Masked-attention mask transformer for universal image segmentation. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1280–1289.
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. (2009). ImageNet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition. IEEE.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby, N. (2020). An image is worth 16x16 words: Transformers for image recognition at scale.
Esteva, A., Chou, K., Yeung, S., Naik, N., Madani, A., Mottaghi, A., Liu, Y., Topol, E., Dean, J., and Socher, R. (2021). Deep learning-enabled medical computer vision. npj Digital Medicine, 4(1).
He, K., Gkioxari, G., Dollar, P., and Girshick, R. (2017). Mask R-CNN. In 2017 IEEE International Conference on Computer Vision (ICCV). IEEE.
Isensee, F., Jaeger, P., Kohl, S., Petersen, J., and Maier-Hein, K. (2021). nnu-net: a self-configuring method for deep learning-based biomedical image segmentation. Nature Methods, 18:1–9.
Jocher, G., Qiu, J., and Chaurasia, A. (2024). Ultralytics YOLO 11. [link].
Khanam, R. and Hussain, M. (2024). Yolov11: An overview of the key architectural enhancements. [link].
Lin, T.-Y., Maire, M., Belongie, S., Bourdev, L., Girshick, R., Hays, J., Perona, P., Ramanan, D., Zitnick, C. L., and Dollár, P. (2014). Microsoft COCO: Common Objects in Context.
Litjens, G., Kooi, T., Bejnordi, B. E., Setio, A. A. A., Ciompi, F., Ghafoorian, M., van der Laak, J. A., van Ginneken, B., and Sánchez, C. I. (2017). A survey on deep learning in medical image analysis. Medical Image Analysis, 42:60–88.
Padilla, R., Netto, S. L., and da Silva, E. A. B. (2020). A survey on performance metrics for object-detection algorithms. In 2020 International Conference on Systems, Signals and Image Processing (IWSSIP), pages 237–242.
Ribeiro, R. d. P. e. S. and von Wangenheim, A. (2024). Automated image quality and protocol adherence assessment of examinations in teledermatology: First results. Telemedicine and e-Health, 30(4):994–1005. PMID: 37930716.
Smith, L. N. and Topin, N. (2019). Super-convergence: very fast training of neural networks using large learning rates. In Pham, T., editor, Artificial Intelligence and Machine Learning for Multi-Domain Operations Applications, page 36. SPIE.
Tian, Y., Ye, Q., and Doermann, D. (2025). Yolov12: Attention-centric real-time object detectors. arXiv preprint arXiv:2502.12524.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L. u., and Polosukhin, I. (2017). Attention is all you need. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R., editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
Cheng, B., Misra, I., Schwing, A. G., Kirillov, A., and Girdhar, R. (2022). Masked-attention mask transformer for universal image segmentation. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1280–1289.
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. (2009). ImageNet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition. IEEE.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby, N. (2020). An image is worth 16x16 words: Transformers for image recognition at scale.
Esteva, A., Chou, K., Yeung, S., Naik, N., Madani, A., Mottaghi, A., Liu, Y., Topol, E., Dean, J., and Socher, R. (2021). Deep learning-enabled medical computer vision. npj Digital Medicine, 4(1).
He, K., Gkioxari, G., Dollar, P., and Girshick, R. (2017). Mask R-CNN. In 2017 IEEE International Conference on Computer Vision (ICCV). IEEE.
Isensee, F., Jaeger, P., Kohl, S., Petersen, J., and Maier-Hein, K. (2021). nnu-net: a self-configuring method for deep learning-based biomedical image segmentation. Nature Methods, 18:1–9.
Jocher, G., Qiu, J., and Chaurasia, A. (2024). Ultralytics YOLO 11. [link].
Khanam, R. and Hussain, M. (2024). Yolov11: An overview of the key architectural enhancements. [link].
Lin, T.-Y., Maire, M., Belongie, S., Bourdev, L., Girshick, R., Hays, J., Perona, P., Ramanan, D., Zitnick, C. L., and Dollár, P. (2014). Microsoft COCO: Common Objects in Context.
Litjens, G., Kooi, T., Bejnordi, B. E., Setio, A. A. A., Ciompi, F., Ghafoorian, M., van der Laak, J. A., van Ginneken, B., and Sánchez, C. I. (2017). A survey on deep learning in medical image analysis. Medical Image Analysis, 42:60–88.
Padilla, R., Netto, S. L., and da Silva, E. A. B. (2020). A survey on performance metrics for object-detection algorithms. In 2020 International Conference on Systems, Signals and Image Processing (IWSSIP), pages 237–242.
Ribeiro, R. d. P. e. S. and von Wangenheim, A. (2024). Automated image quality and protocol adherence assessment of examinations in teledermatology: First results. Telemedicine and e-Health, 30(4):994–1005. PMID: 37930716.
Smith, L. N. and Topin, N. (2019). Super-convergence: very fast training of neural networks using large learning rates. In Pham, T., editor, Artificial Intelligence and Machine Learning for Multi-Domain Operations Applications, page 36. SPIE.
Tian, Y., Ye, Q., and Doermann, D. (2025). Yolov12: Attention-centric real-time object detectors. arXiv preprint arXiv:2502.12524.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L. u., and Polosukhin, I. (2017). Attention is all you need. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R., editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
Publicado
09/06/2025
Como Citar
RIBEIRO, Rodrigo P. S.; WANGENHEIM, Aldo von.
Instance Segmentation in Medical Imaging: A Comparative Study of CNN and Transformer-Based Models in a Teledermatology Study-Case. In: SIMPÓSIO BRASILEIRO DE COMPUTAÇÃO APLICADA À SAÚDE (SBCAS), 25. , 2025, Porto Alegre/RS.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2025
.
p. 819-827.
ISSN 2763-8952.
DOI: https://doi.org/10.5753/sbcas.2025.7814.