Reconhecimento Denso para o Entendimento de Ambiguidades em Ambientes Urbanos de Veículos Autônomos
Resumo
Na navegação autônoma de robôs e veículos móveis, a compreensão de cenas densas — por meio de segmentação semântica e estimativa de profundidade — é essencial para garantir segurança e adaptabilidade em ambientes urbanos complexos. Apesar dos avanços recentes em visão computacional, como transformadores de visão e modelos fundamentais, ainda existem poucos estudos que unificam o reconhecimento denso aplicado a robôs móveis e veículos autônomos. Este trabalho propõe uma arquitetura leve e integrada que explora a complementaridade entre essas duas tarefas, combinando representações extraídas de modelos fundamentais como DINOv2 e Depth Anything. Estratégias de integração semânticogeom étrica, incluindo concatenação de mapas de profundidade, atenção espacial ponderada e atenção cruzada, são investigadas para aumentar a robustez da segmentação em situações ambíguas, como reflexões, oclusões e objetos irreais. A abordagem é avaliada em uma plataforma embarcada (Jetson AGX Orin), considerando métricas quantitativas (mIoU, AbsRel, FPS) e avaliações qualitativas, com foco na viabilidade computacional. Os resultados esperados indicam que, mesmo com um codificador congelado, o treinamento limitado às cabeças combinado com estratégias eficientes pode alcançar um equilíbrio relevante entre desempenho semântico, precisão geométrica e uso de recursos em tempo real.
Referências
D. Eigen and R. Fergus, “Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 2650–2658.
X. Yu, Y. Zuo, Z. Wang, X. Zhang, J. Zhao, Y. Yang, L. Jiao, R. Peng, X. Wang, J. Zhang, K. Zhang, F. Liu, R. Alcover-Couso, J. C. SanMiguel, M. Escudero-Viñolo, H. Tian, K. Matsui, T.Wang, F. Adan, Z. Gao, X. He, Q. Bouniot, H. Moghaddam, S. N. Rai, F. Cermelli, C. Masone, A. Pilzer, E. Ricci, A. Bursuc, A. Solin, M. Trapp, R. Li, A. Yao, W. Chen, I. Simpson, N. D. F. Campbell, and G. Franchi, “The robust semantic segmentation uncv2023 challenge results,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, October 2023, pp. 4618–4628.
J. Spencer, C. S. Qian, C. Russell, S. Hadfield, E. Graf, W. Adams, A. J. Schofield, J. H. Elder, R. Bowden, H. Cong, S. Mattoccia, M. Poggi, Z. K. Suri, Y. Tang, F. Tosi, H. Wang, Y. Zhang, Y. Zhang, and C. Zhao, “The monocular depth estimation challenge,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) Workshops, January 2023, pp. 623–632.
C. Zhao, Q. Sun, C. Zhang, Y. Tang, and F. Qian, “Monocular depth estimation based on deep learning: An overview,” Science China Technological Sciences, vol. 63, no. 9, pp. 1612–1627, 2020.
T. Zhou, M. Brown, N. Snavely, and D. G. Lowe, “Unsupervised learning of depth and ego-motion from video,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1851–1858.
A. Pilzer, S. Lathuiliere, N. Sebe, and E. Ricci, “Refine and distill: Exploiting cycle-inconsistency and knowledge distillation for unsupervised monocular depth estimation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 9768–9777.
B. Cheng, A. Schwing, and A. Kirillov, “Per-pixel classification is not all you need for semantic segmentation,” Advances in neural information processing systems, vol. 34, pp. 17 864–17 875, 2021.
Y. Zhao, L. Wang, X. Yun, C. Chai, Z. Liu, W. Fan, X. Luo, Y. Liu, and X. Qu, “Enhanced scene understanding and situation awareness for autonomous vehicles based on semantic segmentation,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2024.
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning. PMLR, 2021, pp. 8748–8763.
A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo et al., “Segment anything,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 4015–4026.
L. Yang, B. Kang, Z. Huang, X. Xu, J. Feng, and H. Zhao, “Depth anything: Unleashing the power of large-scale unlabeled data,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 10 371–10 381.
D. R. Bruno, R. Berri, F. Barbosa, and F. S. Osorio, “Carina project: Visual perception systems applied for autonomous vehicles and advanced driver assistance systems (adas),” IEEE Access, 2023.
M. Caron, I. Misra, J. Mairal, P. Goyal, P. Bojanowski, and A. Joulin, “Unsupervised learning of visual features by contrasting cluster assignments,” Advances in neural information processing systems, vol. 33, pp. 9912–9924, 2020.
M. Oquab, T. Darcet, T. Moutakanni, H. Vo, M. Szafraniec, V. Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby et al., “Dinov2: Learning robust visual features without supervision,” arXiv preprint arXiv:2304.07193, 2023.
Q. Tang, F. Liu, T. Zhang, J. Jiang, Y. Zhang, B. Zhu, and X. Tang, “Compensating for local ambiguity with encoder-decoder in urban scene segmentation,” IEEE Transactions on Intelligent Transportation Systems, vol. 23, no. 10, pp. 19 224–19 235, 2022.
X. Zhou, M. Liu, B. Luka Zagar, E. Yurtsever, and A. C. Knoll, “Vision language models in autonomous driving and intelligent transportation systems,” arXiv e-prints, pp. arXiv–2310, 2023.
X. Shan and C. Zhang, “Robustness of segment anything model (sam) for autonomous driving in adverse weather conditions,” arXiv preprint arXiv:2306.13290, 2023.
Y. Liu, L. Kong, J. Cen, R. Chen, W. Zhang, L. Pan, K. Chen, and Z. Liu, “Segment any point cloud sequences by distilling vision foundation models,” Advances in Neural Information Processing Systems, vol. 36, 2024.
Y. Xie, Z. Huang, S. Shen, and J. Ma, “Semi-sd: Semi-supervised metric depth estimation via surrounding cameras for autonomous driving,” arXiv preprint arXiv:2503.19713, 2025.
M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset for semantic urban scene understanding,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 3213–3223.
F. Yu, H. Chen, X. Wang, W. Xian, Y. Chen, F. Liu, V. Madhavan, and T. Darrell, “Bdd100k: A diverse driving dataset for heterogeneous multitask learning,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 2636–2645.
