Utilizando a Arquitetura UNet++ na Estimativa de Profundidade Monocular

  • Luiz Antonio Roque Guzzo IFES
  • Kelly Assis de Souza Gazolli IFES


Com o surgimento das redes convolucionais, muitas abordagens foram propostas visando melhorar os resultados na estimativa de profundidade, mas desconsiderando os custos computacionais. Neste trabalho, apresentamos uma abordagem que utiliza a arquitetura UNet++, empregando uma rede MobileNetV2 como codificador, gerando uma estrutura mais leve, com um número menor de parâmetros. Os experimentos realizados na base NYU Depth V2 mostraram que é possível alcançar melhores resultados quando comparado a trabalhos anteriores, mantendo, no entanto, uma estrutura mais simples.


Agarwal, A. and Arora, C. (2022). Attention Attention Everywhere: Monocular Depth Prediction with Skip Attention.

Alhashim, I. and Wonka, P. (2019). High Quality Monocular Depth Estimation via Transfer Learning.

Ali, U., Bayramli, B., Alsarhan, T., and Lu, H. (2021). A lightweight network for monocular depth estimation with decoupled body and edge supervision. Image and Vision Computing, 113:104261.

Bhat, S. F., Alhashim, I., and Wonka, P. (2022). LocalBins: Improving Depth Estimation by Learning Local Distributions.

Cantrell, K., Miller, C., and Morato, C. (2020). Practical Depth Estimation with Image Segmentation and Serial U-Nets:. In Proceedings of the 6th International Conference on Vehicle Technology and Intelligent Transport Systems, pages 406–414, Prague, Czech Republic. SCITEPRESS Science and Technology Publications.

Chen, L., Tang, W., John, N. W., Wan, T. R., and Zhang, J. J. (2017). Augmented Reality for Depth Cues in Monocular Minimally Invasive Surgery. arXiv:1703.01243 [cs].

de Queiroz Mendes, R., Ribeiro, E. G., dos Santos Rosa, N., and Grassi, V. (2021). On deep learning techniques to boost monocular depth estimation for autonomous navigation. Robotics and Autonomous Systems, 136:103701.

Deng, Y., Xiao, J., and Zhou, S. Z. (2021). A lightweight real-time stereo depth estimation network with dynamic upsampling modules. In VISIGRAPP.

Dong, X., Garratt, M. A., Anavatti, S. G., and Abbass, H. A. (2021). MobileXNet: An Efficient Convolutional Neural Network for Monocular Depth Estimation.

Eigen, D. and Fergus, R. (2015). Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-scale Convolutional Architecture. In 2015 IEEE International Conference on Computer Vision (ICCV), pages 2650–2658, Santiago, Chile. IEEE.

Eigen, D., Puhrsch, C., and Fergus, R. (2014). Depth Map Prediction from a Single Image using a Multi-Scale Deep Network. arXiv:1406.2283 [cs].

Han, K., Wang, Y., Tian, Q., Guo, J., Xu, C., and Xu, C. (2020). Ghostnet: More features from cheap operations. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1577–1586.

He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778.

Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., and Adam, H. (2017). MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv:1704.04861 [cs].

Huang, G., Liu, Z., Van Der Maaten, L., and Weinberger, K. Q. (2017). Densely connected convolutional networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2261–2269.

Iandola, F. N., Han, S., Moskewicz, M. W., Ashraf, K., Dally, W. J., and Keutzer, K. (2016). Squeezenet: Alexnet-level accuracy with 50x fewer parameters and lt;0.5mb model size.

Li, J., Klein, R., and Yao, A. (2017). A Two-Streamed Network for Estimating Fine-Scaled Depth Maps from Single RGB Images.

Liu, M. and Zhu, M. (2018). Mobile Video Object Detection with Temporally-Aware Feature Maps. arXiv:1711.06368 [cs].

Liu, X., Sinha, A., Ishii, M., Hager, G. D., Reiter, A., Taylor, R. H., and Unberath, M. (2020). Dense Depth Estimation in Monocular Endoscopy With Self-Supervised Learning Methods. IEEE Transactions on Medical Imaging, 39(5):1438–1447.

Luo, X., Huang, J.-B., Szeliski, R., Matzen, K., and Kopf, J. (2020). Consistent Video Depth Estimation. arXiv:2004.15021 [cs].

Ma, F. and Karaman, S. (2018). Sparse-to-Dense: Depth Prediction from Sparse Depth Samples and a Single Image. arXiv:1709.07492 [cs].

Mancini, M., Costante, G., Valigi, P., Ciarfuglia, T. A., Delmerico, J., and Scaramuzza, D. (2017). Toward Domain Independence for Learning-Based Monocular Depth Estimation. IEEE Robotics and Automation Letters, 2(3):1778–1785.

Ming, Y., Meng, X., Fan, C., and Yu, H. (2021). Deep learning for monocular depth estimation: A review. Neurocomputing, 438:14–33.

Peng Wang, Xiaohui Shen, Zhe Lin, Cohen, S., Price, B., and Yuille, A. (2015). Towards unified depth and semantic prediction from a single image. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2800–2809, Boston, MA, USA. IEEE.

Ramamonjisoa, M. and Lepetit, V. (2019). Sharpnet: Fast and accurate recovery of occluding contours in monocular depth estimation.

Ranftl, R., Bochkovskiy, A., and Koltun, V. (2021). Vision Transformers for Dense Prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 12179–12188.

Ronneberger, O., Fischer, P., and Brox, T. (2015). U-Net: Convolutional Networks for Biomedical Image Segmentation.

Rudolph, M., Dawoud, Y., Güldenring, R., Nalpantidis, L., and Belagiannis, V. (2022). Lightweight Monocular Depth Estimation through Guided Decoding. In 2022 International Conference on Robotics and Automation (ICRA), pages 2344–2350.

Sandler, M., Howard, A. G., Zhu, M., Zhmoginov, A., and Chen, L. (2018). Mobilenetv2: Inverted residuals and linear bottlenecks. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 4510–4520. Computer Vision Foundation / IEEE Computer Society.

Silberman, N., Hoiem, D., Kohli, P., and Fergus, R. (2012). Indoor segmentation and support inference from rgbd images. In Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., and Schmid, C., editors, Computer Vision – ECCV 2012, pages 746–760, Berlin, Heidelberg. Springer Berlin Heidelberg.

Simonyan, K. and Zisserman, A. (2015). Very deep convolutional networks for large-scale image recognition. In Bengio, Y. and LeCun, Y., editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings.

Tu, X., Xu, C., Liu, S., Li, R., Xie, G., Huang, J., and Yang, L. T. (2021). Efficient Monocular Depth Estimation for Edge Devices in Internet of Things. IEEE Transactions on Industrial Informatics, 17(4):2821–2832.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. (2017). Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008.

Wang, Y. and Zhu, H. (2022). Monocular Depth Estimation: Lightweight Convolutional and Matrix Capsule Feature-Fusion Network. Sensors, 22(17):6344.

Wang, Z., Bovik, A., Sheikh, H., and Simoncelli, E. (2004). Image Quality Assessment: From Error Visibility to Structural Similarity. IEEE Transactions on Image Processing, 13(4):600–612.

Wofk, D., Ma, F., Yang, T.-J., Karaman, S., and Sze, V. (2019). Fastdepth: Fast monocular depth estimation on embedded systems. In 2019 International Conference on Robotics and Automation (ICRA), pages 6101–6108.

Wu, B., Dai, X., Zhang, P., Wang, Y., Sun, F., Wu, Y., Tian, Y., Vajda, P., Jia, Y., and Keutzer, K. (2019). Fbnet: Hardware-aware efficient convnet design via differentiable neural architecture search. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pages 10734–10742. Computer Vision Foundation / IEEE.

Xu, D., Ricci, E., Ouyang, W., Wang, X., and Sebe, N. (2017). Multi-scale Continuous CRFs as Sequential Deep Networks for Monocular Depth Estimation. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 161–169, Honolulu, HI. IEEE.

Xu, D., Wang, W., Tang, H., Liu, H., Sebe, N., and Ricci, E. (2018). Structured Attention Guided Convolutional Neural Fields for Monocular Depth Estimation.

Zhang, H., Shen, C., Li, Y., Cao, Y., Liu, Y., and Yan, Y. (2019). Exploiting temporal consistency for real-time video depth estimation. arXiv:1908.03706 [cs].

Zhang, X., Zhou, X., Lin, M., and Sun, J. (2017). Shufflenet: An extremely efficient convolutional neural network for mobile devices. CoRR, abs/1707.01083.

Zhao, C., Sun, Q., Zhang, C., Tang, Y., and Qian, F. (2020). Monocular depth estimation based on deep learning: An overview. Science China Technological Sciences, 63(9):1612–1627.

Zheng, Q., Yu, T., and Wang, F. (2023). Dcu-net: Self-supervised monocular depth estimation based on densely connected u-shaped convolutional neural networks. Computers Graphics, 111:145–154.

Zhou, Z., Rahman Siddiquee, M., Tajbakhsh, N., and Liang, J. (2018). Unet++: A nested u-net architecture for medical image segmentation. In Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support 4th International Workshop, DLMIA 2018 and 8th International Workshop, ML-CDS 2018 Held in Conjunction with MICCAI 2018, Lecture Notes in Computer Science, pages 3–11. Springer Verlag.
GUZZO, Luiz Antonio Roque; GAZOLLI, Kelly Assis de Souza. Utilizando a Arquitetura UNet++ na Estimativa de Profundidade Monocular. In: SEMINÁRIO INTEGRADO DE SOFTWARE E HARDWARE (SEMISH), 50. , 2023, João Pessoa/PB. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2023 . p. 131-142. ISSN 2595-6205. DOI: https://doi.org/10.5753/semish.2023.229972.