Classification of Lesions in Capsule Endoscopy Images using Swin Transformer and Semi-Supervised Learning
Resumo
Automated analysis of images obtained by Wireless Capsule Endoscopy (WCE) is a significant challenge in the medical field, especially due to the difficulty in lesion detection, the scarcity of labeled samples, and the high visual variability of the images. This work proposes a multiclass classification method for luminal findings in WCE images based on the Swin Transformer architecture, structured in two sequential stages: a binary classifier, responsible for filtering normal from anomalous images, followed by a multiclass classifier for identifying the specific lesion. To address the limitation of labeled data, offline data augmentation and semi-supervised learning techniques were employed. Experiments performed on the Kvasir-Capsule array with six classes of luminal findings demonstrated that Transformer-based architectures consistently outperform traditional CNN models such as ResNet-50, MobileNetV3, and EfficientNetV2. The Swin Transformer model achieved 98% accuracy and an F1-score in the binary step and 86% in multiclass classification, representing a gain of 4 percentage points compared to purely supervised training.Referências
Long Bai, Liangyu Wang, Tong Chen, Yuanhao Zhao, and Hongliang Ren. Transformer-based disease identification for small-scale imbalanced capsule endoscopy dataset. Electronics, 11(17):2747, 2022.
Sabina Beg, Tim Card, Reena Sidhu, Ewa Wronska, Krish Ragunath, Hey-Long Ching, Anastasios Koulaouzidis, Diana Yung, Simon Panter, Mark Mcalindon, et al. The impact of reader fatigue on the accuracy of capsule endoscopy interpretation. Digestive and Liver Disease, 53(8):1028–1033, 2021.
Veysel Yusuf Cambay, Prabal Datta Barua, Abdul Hafeez Baig, Sengul Dogan, Mehmet Baygin, Turker Tuncer, and UR Acharya. Automated detection of gastrointestinal diseases using resnet50*-based explainable deep feature engineering model with endoscopy images. Sensors, 24(23):7710, 2024.
Adrian B Chłopowiec, Adam R Chłopowiec, Krzysztof Galus, Wojciech Cebula, and Martin Tabakov. Local lesion generation is effective for capsule endoscopy image data augmentation in a limited data setting. arXiv preprint arXiv:2411.03098, 2024.
Abhishek Choudhary, Mayur Raj, and Kanishk Kumar. High-performance capsule endoscopy classification using swin transformers, 2024.
P Cortegoso Valdivia, U Deding, T Bjørsum-Meyer, G Baatrup, I Fernández-Urién, X Dray, P Boal-Carvalho, P Ellul, E Toth, E Rondonotti, et al. Inter/intra-observer agreement in video-capsule endoscopy: Are we getting it all wrong? a systematic review and meta-analysis. diagnostics. 2022; 12 (10): 2400.
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
Mateus F de C Ferreira, Paula D Portella, Juliana F de Souza, Bruna C Dias, Luciana R da S Assunção, and Lucas F de Oliveira. Avaliaçao do uso redes neurais convolucionais para identificaçao de lesoes cariosas dentárias. In Simpósio Brasileiro de Computação Aplicada à Saúde (SBCAS), pages 473–478. SBC, 2023.
Rodrigo Gounella, Talita Conte Granado, Oswaldo Hideo Ando Junior, Daniel Luís Luporini, Mario Gazziro, and João Paulo Carmo. Endoscope capsules: The present situation and future outlooks. Bioengineering, 10(12):1347, 2023.
Xiaoqing Guo and Yixuan Yuan. Semi-supervised wce image classification with adaptive aggregated attention. Medical Image Analysis, 64:101733, 2020.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, et al. Searching for mobilenetv3. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1314–1324, 2019.
Parminder Kaur and Rakesh Kumar. Performance analysis of convolutional neural network architectures over wireless capsule endoscopy dataset. Bulletin of Electrical Engineering and Informatics, 13(1):312–319, 2024.
Sang Hoon Kim and Yun Jeong Lim. Artificial intelligence in capsule endoscopy: A practical guide to its past and future challenges. Diagnostics, 11(9):1722, 2021.
Dongguang Li, David Cave, April Li, and Shaoguang Li. Enhanced accuracy for classification of video capsule endoscopy images using multiple deep learning convolutional neural networks. iGIE, 3(1):72–81, 2024.
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021.
Guobing Pan and Litong Wang. Swallowable wireless capsule endoscopy: Progress and technical challenges. Gastroenterology research and practice, 2012(1):841691, 2012.
Marco Pennazio, Cristiano Spada, Rami Eliakim, Martin Keuchel, Andrea May, Chris J Mulder, Emanuele Rondonotti, Samuel N Adler, Joerg Albert, Peter Baltes, et al. Small-bowel capsule endoscopy and device-assisted enteroscopy for diagnosis and treatment of small-bowel disorders: European society of gastrointestinal endoscopy (esge) clinical guideline. Endoscopy, 47(04):352–386, 2015.
Smriti Regmi, Aliza Subedi, Ulas Bagci, and Debesh Jha. Vision transformer for efficient chest x-ray and gastrointestinal image classification, 2023. URL [link].
Rodrigo PS Ribeiro and Aldo von Wangenheim. Instance segmentation in medical imaging: A comparative study of cnn and transformer-based models in a teledermatology study-case. In Simpósio Brasileiro de Computação Aplicada à Saúde (SBCAS), pages 819–827. SBC, 2025.
Pia H Smedsrud, Vajira Thambawita, Steven A Hicks, Henrik Gjestang, Oda Olsen Nedrejord, Espen Næss, Hanna Borgli, Debesh Jha, Tor Jan Derek Berstad, Sigrun L Eskeland, et al. Kvasir-capsule, a video capsule endoscopy dataset. Scientific Data, 8 (1):142, 2021.
Kihyuk Sohn, David Berthelot, Nicholas Carlini, Zizhao Zhang, Han Zhang, Colin A Raffel, Ekin Dogus Cubuk, Alexey Kurakin, and Chun-Liang Li. Fixmatch: Simplifying semi-supervised learning with consistency and confidence. Advances in neural information processing systems, 33:596–608, 2020.
Peter Sugimura and Florian Hartl. Building a reproducible machine learning pipeline. arXiv preprint arXiv:1810.04570, 2018.
Mingxing Tan and Quoc Le. Efficientnetv2: Smaller models and faster training. In International conference on machine learning, pages 10096–10106. PMLR, 2021.
Amy Wang, Subhas Banerjee, Bradley A Barth, Yasser M Bhat, Shailendra Chauhan, Klaus T Gottlieb, Vani Konda, John T Maple, Faris Murad, Patrick R Pfau, et al. Wireless capsule endoscopy. Gastrointestinal endoscopy, 78(6):805–815, 2013.
Sabina Beg, Tim Card, Reena Sidhu, Ewa Wronska, Krish Ragunath, Hey-Long Ching, Anastasios Koulaouzidis, Diana Yung, Simon Panter, Mark Mcalindon, et al. The impact of reader fatigue on the accuracy of capsule endoscopy interpretation. Digestive and Liver Disease, 53(8):1028–1033, 2021.
Veysel Yusuf Cambay, Prabal Datta Barua, Abdul Hafeez Baig, Sengul Dogan, Mehmet Baygin, Turker Tuncer, and UR Acharya. Automated detection of gastrointestinal diseases using resnet50*-based explainable deep feature engineering model with endoscopy images. Sensors, 24(23):7710, 2024.
Adrian B Chłopowiec, Adam R Chłopowiec, Krzysztof Galus, Wojciech Cebula, and Martin Tabakov. Local lesion generation is effective for capsule endoscopy image data augmentation in a limited data setting. arXiv preprint arXiv:2411.03098, 2024.
Abhishek Choudhary, Mayur Raj, and Kanishk Kumar. High-performance capsule endoscopy classification using swin transformers, 2024.
P Cortegoso Valdivia, U Deding, T Bjørsum-Meyer, G Baatrup, I Fernández-Urién, X Dray, P Boal-Carvalho, P Ellul, E Toth, E Rondonotti, et al. Inter/intra-observer agreement in video-capsule endoscopy: Are we getting it all wrong? a systematic review and meta-analysis. diagnostics. 2022; 12 (10): 2400.
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
Mateus F de C Ferreira, Paula D Portella, Juliana F de Souza, Bruna C Dias, Luciana R da S Assunção, and Lucas F de Oliveira. Avaliaçao do uso redes neurais convolucionais para identificaçao de lesoes cariosas dentárias. In Simpósio Brasileiro de Computação Aplicada à Saúde (SBCAS), pages 473–478. SBC, 2023.
Rodrigo Gounella, Talita Conte Granado, Oswaldo Hideo Ando Junior, Daniel Luís Luporini, Mario Gazziro, and João Paulo Carmo. Endoscope capsules: The present situation and future outlooks. Bioengineering, 10(12):1347, 2023.
Xiaoqing Guo and Yixuan Yuan. Semi-supervised wce image classification with adaptive aggregated attention. Medical Image Analysis, 64:101733, 2020.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, et al. Searching for mobilenetv3. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1314–1324, 2019.
Parminder Kaur and Rakesh Kumar. Performance analysis of convolutional neural network architectures over wireless capsule endoscopy dataset. Bulletin of Electrical Engineering and Informatics, 13(1):312–319, 2024.
Sang Hoon Kim and Yun Jeong Lim. Artificial intelligence in capsule endoscopy: A practical guide to its past and future challenges. Diagnostics, 11(9):1722, 2021.
Dongguang Li, David Cave, April Li, and Shaoguang Li. Enhanced accuracy for classification of video capsule endoscopy images using multiple deep learning convolutional neural networks. iGIE, 3(1):72–81, 2024.
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021.
Guobing Pan and Litong Wang. Swallowable wireless capsule endoscopy: Progress and technical challenges. Gastroenterology research and practice, 2012(1):841691, 2012.
Marco Pennazio, Cristiano Spada, Rami Eliakim, Martin Keuchel, Andrea May, Chris J Mulder, Emanuele Rondonotti, Samuel N Adler, Joerg Albert, Peter Baltes, et al. Small-bowel capsule endoscopy and device-assisted enteroscopy for diagnosis and treatment of small-bowel disorders: European society of gastrointestinal endoscopy (esge) clinical guideline. Endoscopy, 47(04):352–386, 2015.
Smriti Regmi, Aliza Subedi, Ulas Bagci, and Debesh Jha. Vision transformer for efficient chest x-ray and gastrointestinal image classification, 2023. URL [link].
Rodrigo PS Ribeiro and Aldo von Wangenheim. Instance segmentation in medical imaging: A comparative study of cnn and transformer-based models in a teledermatology study-case. In Simpósio Brasileiro de Computação Aplicada à Saúde (SBCAS), pages 819–827. SBC, 2025.
Pia H Smedsrud, Vajira Thambawita, Steven A Hicks, Henrik Gjestang, Oda Olsen Nedrejord, Espen Næss, Hanna Borgli, Debesh Jha, Tor Jan Derek Berstad, Sigrun L Eskeland, et al. Kvasir-capsule, a video capsule endoscopy dataset. Scientific Data, 8 (1):142, 2021.
Kihyuk Sohn, David Berthelot, Nicholas Carlini, Zizhao Zhang, Han Zhang, Colin A Raffel, Ekin Dogus Cubuk, Alexey Kurakin, and Chun-Liang Li. Fixmatch: Simplifying semi-supervised learning with consistency and confidence. Advances in neural information processing systems, 33:596–608, 2020.
Peter Sugimura and Florian Hartl. Building a reproducible machine learning pipeline. arXiv preprint arXiv:1810.04570, 2018.
Mingxing Tan and Quoc Le. Efficientnetv2: Smaller models and faster training. In International conference on machine learning, pages 10096–10106. PMLR, 2021.
Amy Wang, Subhas Banerjee, Bradley A Barth, Yasser M Bhat, Shailendra Chauhan, Klaus T Gottlieb, Vani Konda, John T Maple, Faris Murad, Patrick R Pfau, et al. Wireless capsule endoscopy. Gastrointestinal endoscopy, 78(6):805–815, 2013.
Publicado
01/06/2026
Como Citar
OLIVEIRA, Alejandro Costa de; CELLA, Mario Vítor Vieira; QUINTANILHA, Darlan Bruno Pontes; SOARES FILHO, Celso Luiz Silva; CLÍMACO, Francisco Glaubos Nunes; BORCHARTT, Tiago Bonini; PAIVA, Anselmo Cardoso de.
Classification of Lesions in Capsule Endoscopy Images using Swin Transformer and Semi-Supervised Learning. In: SIMPÓSIO BRASILEIRO DE COMPUTAÇÃO APLICADA À SAÚDE (SBCAS), 26. , 2026, Ouro Preto/MG.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2026
.
p. 205-216.
ISSN 2763-8952.
DOI: https://doi.org/10.5753/sbcas.2026.20610.
