Estudo de Estratégia de Aprendizado Auto-supervisionado para Aprimoramento da Consistência Temporal em Modelo de Segmentação Semântica Baseado em Deep Learning

  • Felipe M. Barbosa USP
  • Fernando S. Osório USP

Abstract


Deep Learning-based Semantic segmentation is a task of utmost importance in visual perception for autonomous mobile robots. However, great part of the current research explores single-frame perception. This approach, besides neglecting the possibilities offered by the use of temporal data, leads to unstable models. In light of that, and considering the high cost of data labeling, new learning alternatives try to leverage the widely-available non-labeled temporal data. Therefore, in this work, we study the application of a self-supervised auxiliary supervision strategy for the promotion of temporal stability in semantic segmentation models. The results demonstrate that this strategy promotes model’s precision and stability, even when utilizing data from distinct datasets.

References

Badrinarayanan, V., Kendall, A., and Cipolla, R. (2017). Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(12):2481–2495.

Contributors, M. (2020). MMSegmentation: Openmmlab semantic segmentation toolbox and benchmark. https://github.com/open-mmlab/mmsegmentation.

Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., and Schiele, B. (2016). The cityscapes dataset for semantic urban scene understanding. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

Everingham, M., Eslami, S. M. A., Van Gool, L., Williams, C. K. I., Winn, J., and Zisserman, A. (2015). The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision, 111(1):98–136.

Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., and Brox, T. (2017). Flownet 2.0: Evolution of optical flow estimation with deep networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1647–1655.

Lee, H.-Y., Huang, J.-B., Singh, M., and Yang, M.-H. (2017). Unsupervised representation learning by sorting sequences. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 667–676.

Li, J., Wang, W., Chen, J., Niu, L., Si, J., Qian, C., and Zhang, L. (2021). Video semantic segmentation via sparse temporal transformer. In Proceedings of the 29th ACM International Conference on Multimedia, MM ’21, page 59–68, New York, NY, USA. Association for Computing Machinery.

Liu, Y., Shen, C., Yu, C., and Wang, J. (2020). Efficient semantic video segmentation with per-frame inference. In Vedaldi, A., Bischof, H., Brox, T., and Frahm, J.-M., editors, Computer Vision – ECCV 2020, pages 352–368, Cham. Springer International Publishing.

Long, J., Shelhamer, E., and Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3431–3440.

Oršić, M. and Šegvić, S. (2021). Efficient semantic segmentation with pyramidal fusion. Pattern Recognition, 110:107611.

Ronneberger, O., Fischer, P., and Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. In Navab, N., Hornegger, J., Wells, W. M., and Frangi, A. F., editors, Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015, pages 234–241, Cham. Springer International Publishing.

Seyedhosseini, M. and Tasdizen, T. (2016). Semantic image segmentation with contextual hierarchical models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(5):951–964.

Shi, W., Xu, J., Zhu, D., Zhang, G., Wang, X., Li, J., and Zhang, X. (2022). Rgb-d semantic segmentation and label-oriented voxelgrid fusion for accurate 3d semantic mapping. IEEE Transactions on Circuits and Systems for Video Technology, 32(1):183–197.

Shinzato, P. Y. and Wolf, D. F. (2010). Statistical analysis of image-features used as inputs of an road identifier based in artificial neural networks. In 2010 Latin American Robotics Symposium and Intelligent Robotics Meeting, pages 19–24.

Varghese, S., Bayzidi, Y., Bär, A., Kapoor, N., Lahiri, S., Schneider, J. D., Schmidt, N., Schlicht, P., Hüger, F., and Fingscheidt, T. (2020). Unsupervised temporal consistency metric for video segmentation in highly-automated driving. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 1369–1378.

Varghese, S., Gujamagadi, S., Klingner, M., Kapoor, N., Bär, A., Schneider, J. D., Maag, K., Schlicht, P., Hüger, F., and Fingscheidt, T. (2021). An unsupervised temporal consistency (tc) loss to improve the performance of semantic segmentation networks. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 12–20.

Xie, J., Kiefel, M., Sun, M.-T., and Geiger, A. (2016). Semantic instance annotation of street scenes by 3d to 2d label transfer. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3688–3697.

Xiong, J., Po, L.-M., Yu, W. Y., Zhao, Y., and Cheung, K.-W. (2021). Distortion map-guided feature rectification for efficient video semantic segmentation. IEEE Transactions on Multimedia, pages 1–1.

Yu, C., Gao, C., Wang, J., Yu, G., Shen, C., and Sang, N. (2021). Bisenet v2: Bilateral network with guided aggregation for real-time semantic segmentation. International Journal of Computer Vision, 129(11):3051–3068.

Yu, C., Wang, J., Peng, C., Gao, C., Yu, G., and Sang, N. (2018). Bisenet: Bilateral segmentation network for real-time semantic segmentation. In Ferrari, V., Hebert, M., Sminchisescu, C., and Weiss, Y., editors, Computer Vision – ECCV 2018, pages 334–349, Cham. Springer International Publishing.

Zhang, R., Isola, P., and Efros, A. A. (2016). Colorful image colorization. In Leibe, B., Matas, J., Sebe, N., and Welling, M., editors, Computer Vision – ECCV 2016, pages 649–666, Cham. Springer International Publishing.
Published
2023-08-06
BARBOSA, Felipe M.; OSÓRIO, Fernando S.. Estudo de Estratégia de Aprendizado Auto-supervisionado para Aprimoramento da Consistência Temporal em Modelo de Segmentação Semântica Baseado em Deep Learning. In: INTEGRATED SOFTWARE AND HARDWARE SEMINAR (SEMISH), 50. , 2023, João Pessoa/PB. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2023 . p. 214-225. ISSN 2595-6205. DOI: https://doi.org/10.5753/semish.2023.230573.