Comparing U-Net based architectures in monocular depth estimation

  • Antônio Carlos Durães da Silva IFES
  • Kelly Assis de Souza Gazolli IFES


Monocular depth estimation is a computer vision problem which has diverse applications ranging from augmented reality to surgical procedures. Given the similarity between the segmentation and monocular depth estimation tasks, in addition to the good performance of the U-net network and its variations in the segmentation task, this study aims to compare the performance of variations of U-Net and UNet++ architectures, each one adopting a different network as encoder, and the TransUnet architecture in monocular depth estimation. The results achieved on the NYU Depth V2 dataset shows that U-Net using Mix Transformer (MiT-B2) as encoder outperforms all other evaluated approaches.

Palavras-chave: Monocular depth estimation, U-Net, UNet, Transunet


A. Mertan, D. J. Duff, and G. Unal, “Single image depth estimation: An overview,” Digital Signal Processing, vol. 123, p. 103441, 2022. [Online]. Available: [link].

R. Huang and M. Sun, “Network algorithm real-time depth image 3d human recognition for augmented reality,” Journal of Real-Time Image Processing, vol. 18, no. 2, pp. 307–319, Nov. 2020. [Online]. Available:

Y. Xiao, F. Codevilla, A. Gurram, O. Urfalioglu, and A. M. Lopez, “Multimodal end-to-end autonomous driving,” IEEE Transactions on Intelligent Transportation Systems, vol. 23, no. 1, pp. 537–547, Jan. 2022. [Online]. Available:

H. Itoh, M. Oda, Y. Mori, M. Misawa, S.-E. Kudo, K. Imai, S. Ito, K. Hotta, H. Takabatake, M. Mori, H. Natori, and K. Mori, “Unsupervised colonoscopic depth estimation by domain translations with a lambertian-reflection keeping auxiliary task,” International Journal of Computer Assisted Radiology and Surgery, vol. 16, no. 6, pp. 989–1001, May 2021. [Online]. Available:

X. Liu, A. Sinha, M. Ishii, G. D. Hager, A. Reiter, R. H. Taylor, and M. Unberath, “Dense depth estimation in monocular endoscopy with self-supervised learning methods,” IEEE Transactions on Medical Imaging, vol. 39, no. 5, pp. 1438–1447, May 2020. [Online]. Available:

M. Poggi, F. Tosi, K. Batsos, P. Mordohai, and S. Mattoccia, “On the synergies between machine learning and binocular stereo for depth estimation from images: a survey,” 2020. [Online]. Available: [link]

J. Xie, C. Lei, Z. Li, L. E. Li, and Q. Chen, “Video depth estimation by fusing flow-to-depth proposals,” 2019. [Online]. Available: [link]

Y. Ming, X. Meng, C. Fan, and H. Yu, “Deep learning for monocular depth estimation: A review,” Neurocomputing, vol. 438, pp. 14–33, May 2021. [Online]. Available:

Y. J. Jung, A. Baik, J. Kim, and D. Park, “A novel 2d-to-3d conversion technique based on relative height-depth cue,” in SPIE Proceedings, A. J. Woods, N. S. Holliman, and J. O. Merritt, Eds. SPIE, Feb. 2009. [Online]. Available:

K. Han and K. Hong, “Geometric and texture cue based depth-map estimation for 2d to 3d image conversion,” in 2011 IEEE International Conference on Consumer Electronics (ICCE), 2011, pp. 651–652.

H. Yan, X. Yu, Y. Zhang, S. Zhang, X. Zhao, and L. Zhang, “Single image depth estimation with normal guided scale invariant deep convolutional fields,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 29, no. 1, pp. 80–92, 2019.

S.-P. Tseng and S.-H. Lai, “Accurate depth map estimation from video via mrf optimization,” in 2011 Visual Communications and Image Processing (VCIP), 2011, pp. 1–4.

I. Ulku and E. Akagündüz, “A survey on deep learning-based architectures for semantic segmentation on 2d images,” Applied Artificial Intelligence, vol. 36, no. 1, p. 2032924, 2022. [Online]. Available:

Z. Zhou, M. M. R. Siddiquee, N. Tajbakhsh, and J. Liang, “Unet++: Redesigning skip connections to exploit multiscale features in image segmentation,” IEEE Transactions on Medical Imaging, vol. 39, no. 6, pp. 1856–1867, Jun. 2020. [Online]. Available:

H. Huang, L. Lin, R. Tong, H. Hu, Q. Zhang, Y. Iwamoto, X. Han, Y.-W. Chen, and J. Wu, “Unet 3+: A full-scale connected unet for medical image segmentation,” 2020.

N. Siddique, S. Paheding, C. P. Elkin, and V. Devabhaktuni, “U-net and its variants for medical image segmentation: A review of theory and applications,” IEEE Access, vol. 9, pp. 82 031–82 057, 2021.

O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Lecture Notes in Computer Science. Springer International Publishing, 2015, pp. 234–241. [Online]. Available:

N. S. Punn and S. Agarwal, “Modality specific u-net variants for biomedical image segmentation: a survey,” Artificial Intelligence Review, vol. 55, no. 7, pp. 5845–5889, Mar. 2022. [Online]. Available:

N. He, L. Fang, and A. Plaza, “Hybrid first and second order attention unet for building segmentation in remote sensing images,” Science China Information Sciences, vol. 63, no. 4, Mar. 2020. [Online]. Available:

K. Cao and X. Zhang, “An improved res-unet model for tree species classification using airborne high-resolution images,” Remote Sensing, vol. 12, no. 7, 2020. [Online]. Available: [link]

C. Godard, O. Mac Aodha, M. Firman, and G. J. Brostow, “Digging into self-supervised monocular depth estimation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019.

C. Shu, K. Yu, Z. Duan, and K. Yang, “Feature-metric loss for self-supervised learning of depth and egomotion,” in Computer Vision – ECCV 2020. Springer International Publishing, 2020, pp. 572–588. [Online]. Available:

S. Pillai, R. Ambrus¸, and A. Gaidon, “Superdepth: Self-supervised, super-resolved monocular depth estimation,” in 2019 International Conference on Robotics and Automation (ICRA), 2019, pp. 9250–9256.

N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, “Indoor segmentation and support inference from RGBD images,” in Computer Vision – ECCV 2012. Springer Berlin Heidelberg, 2012, pp. 746–760. [Online]. Available:

I. Alhashim and P. Wonka, “High quality monocular depth estimation via transfer learning,” arXiv e-prints, vol. abs/1812.11941, 2018. [Online]. Available: [link]

C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 1–9.

A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in Neural Information Processing Systems, F. Pereira, C. Burges, L. Bottou, and K. Weinberger, Eds., vol. 25. Curran Associates, Inc., 2012. [Online]. Available: [link].

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.

Y. Yang, Y. Wang, C. Zhu, M. Zhu, H. Sun, and T. Yan, “Mixed-scale unet based on dense atrous pyramid for monocular depth estimation,” IEEE Access, vol. 9, pp. 114 070–114 084, 2021.

H.-T. Duong, H.-M. Chen, and C.-C. Chang, “URNet: An UNet-based model with residual mechanism for monocular depth estimation,” Electronics, vol. 12, no. 6, p. 1450, Mar. 2023. [Online]. Available:

S. Saxena, A. Kar, M. Norouzi, and D. J. Fleet, “Monocular depth estimation using diffusion models,” 2023.

A. Jan and S. Seo, “Monocular depth estimation using res-UNet with an attention model,” Applied Sciences, vol. 13, no. 10, p. 6319, May 2023. [Online]. Available:

L. Guzzo and K. Gazolli, “Utilizando a arquitetura unet++ na estimativa de profundidade monocular,” in Anais do L Seminário Integrado de Software e Hardware. Porto Alegre, RS, Brasil: SBC, 2023, pp. 131–142. [Online]. Available: [link].

E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and P. Luo, “Segformer: Simple and efficient design for semantic segmentation with transformers,” 2021.

C. Szegedy, S. Ioffe, V. Vanhoucke, and A. Alemi, “Inception-v4, inception-resnet and the impact of residual connections on learning,” 2016.

F. Chollet, “Xception: Deep learning with depthwise separable convolutions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.

K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” 2015.

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds., vol. 30. Curran Associates, Inc., 2017. [Online]. Available: [link].

S. Khan, M. Naseer, M. Hayat, S. W. Zamir, F. S. Khan, and M. Shah, “Transformers in vision: A survey,” ACM Comput. Surv., vol. 54, no. 10s, sep 2022. [Online]. Available:

N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in Computer Vision – ECCV 2020. Springer International Publishing, 2020, pp. 213–229. [Online]. Available:

D. Hong, Z. Han, J. Yao, L. Gao, B. Zhang, A. Plaza, and J. Chanussot, “Spectralformer: Rethinking hyperspectral image classification with transformers,” IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–15, 2022.

J. Chen, Y. Lu, Q. Yu, X. Luo, E. Adeli, Y. Wang, L. Lu, A. L. Yuille, and Y. Zhou, “Transunet: Transformers make strong encoders for medical image segmentation,” 2021.

D. Eigen, C. Puhrsch, and R. Fergus, “Depth map prediction from a single image using a multi-scale deep network,” 2014.

H. Fu, M. Gong, C. Wang, K. Batmanghelich, and D. Tao, “Deep ordinal regression network for monocular depth estimation,” 2018.

L. He, G. Wang, and Z. Hu, “Learning depth from single images with deep neural network embedding focal length,” IEEE Transactions on Image Processing, vol. 27, no. 9, pp. 4676–4689, sep. [Online]. Available:

X. Qi, R. Liao, Z. Liu, R. Urtasun, and J. Jia, “Geonet: Geometric neural network for joint depth and surface normal estimation,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 283–291.

L. He, J. Lu, G. Wang, S. Song, and J. Zhou, “SOSD-net: Joint semantic object segmentation and depth estimation from monocular images,” Neurocomputing, vol. 440, pp. 251–263, Jun. 2021. [Online]. Available:

Z. Zhang, Z. Cui, C. Xu, Z. Jie, X. Li, and J. Yang, “Joint task-recursive learning for semantic segmentation and depth estimation,” in Computer Vision – ECCV 2018. Springer International Publishing, 2018, pp. 238–255. [Online]. Available:
SILVA, Antônio Carlos Durães da; GAZOLLI, Kelly Assis de Souza. Comparing U-Net based architectures in monocular depth estimation. In: WORKSHOP DE VISÃO COMPUTACIONAL (WVC), 18. , 2023, São Bernardo do Campo/SP. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2023 . p. 48-53. DOI: