Detecção de objetos em cenários de escassez de rótulos usando pseudo-rótulos gerados pelo SAM3

João V. D. Sobrinho; Miguel E. M. Campista

doi:10.5753/sbcup.2026.22421

João V. D. Sobrinho UFRJ
Miguel E. M. Campista UFRJ

DOI: https://doi.org/10.5753/sbcup.2026.22421

Resumo

A escassez de dados rotulados é um dos principais desafios para o treinamento de modelos de detecção de objetos. Técnicas de detecção semissupervisionada abordam esse problema explorando dados não rotulados por meio de pseudo-rótulos, porém podem apresentar instabilidade em detectores de um estágio. Este trabalho investiga o uso do modelo de fundação Segment Anything Model 3 (SAM3) como gerador automático de rótulos para o treinamento de detectores de objetos. Experimentos indicam que o uso de rótulos gerados pelo SAM3 pode superar o treinamento puramente supervisionado considerando a mesma quantidade de dados rotulados manualmente, evidenciando o potencial de modelos de fundação para reduzir a dependência de anotações humanas.

Referências

Bhaskar, U., Bhattacharya, R., Patel, A., Khoche, S., Kulkarni, P. A., and Manwani, N. (2025). Robust object detection with pseudo labels from vlms using per-object co-teaching.

Bommasani, R., Hudson, D. A., Adeli, E., et al. (2022). On the opportunities and risks of foundation models.

Carion, N., Gustafson, L., Hu, Y.-T., et al. (2025). Sam 3: Segment anything with concepts.

Chapelle, O., Scholkopf, B., and Zien, Eds., A. (2009). Semi-supervised learning (chapelle, o. et al., eds.; 2006) [book reviews]. IEEE Transactions on Neural Networks, 20(3):542–542.

Figueiredo, C. and Melo, T. (2025). Explorando o uso de vlms para classificação zero-shot de imagens. In Anais do XVII Simpósio Brasileiro de Computação Ubíqua e Pervasiva, pages 1–10, Porto Alegre, RS, Brasil. SBC.

Jin, Q., Yuan, M., Li, S., Wang, H., Wang, M., and Song, Z. (2022). Cold-start active learning for image classification. Information Sciences, 616:16–36.

Jocher, G. and Qiu, J. (2024). Ultralytics yolo11. [link].

Lee, D.-H. (2013). Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In Workshop on Challenges in Representation Learning, ICML.

Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., and Zitnick, C. L. (2014). Microsoft coco: Common objects in context. In Computer vision–ECCV 2014: 13th European conference, zurich, Switzerland, September 6-12, 2014, proceedings, part v 13, pages 740–755. Springer.

Redmon, J., Divvala, S., Girshick, R., and Farhadi, A. (2016). You only look once: Unified, real-time object detection. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 779–788.

Roboflow (2023). autodistill. [link]. Version 0.1.0, MIT license.

Roy, A., Cobb, A., Kaur, R., Jha, S., Bastian, N., Berenbeim, A., Thomson, R., Cruickshank, I., Velasquez, A., and Jha, S. (2025). Zero-shot detection of out-of-context objects using foundation models. In Proceedings of the Winter Conference on Applications of Computer Vision (WACV), pages 9168–9177.

Sohn, K., Zhang, Z., Li, C.-L., Zhang, H., Lee, C.-Y., and Pfister, T. (2020). A simple semi-supervised learning framework for object detection. In arXiv:2005.04757.

Yu, F., Chen, H., Wang, X., Xian, W., Chen, Y., Liu, F., Madhavan, V., and Darrell, T. (2020). Bdd100k: A diverse driving dataset for heterogeneous multitask learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

Zhang, Y., Yao, X., Liu, C., Chen, F., Song, X., Xing, T., Hu, R., Chai, H., Xu, P., and Zhang, G. (2022). S4od: Semi-supervised learning for single-stage object detection.

Zou, Z., Chen, K., Shi, Z., Guo, Y., and Ye, J. (2023). Object detection in 20 years: A survey. Proceedings of the IEEE, 111(3):257–276.