Estudo de Estratégia de Aprendizado Auto-supervisionado para Aprimoramento da Consistência Temporal em Modelo de Segmentação Semântica Baseado em Deep Learning

Felipe M. Barbosa; Fernando S. Osório

doi:10.5753/semish.2023.230573

Felipe M. Barbosa USP
Fernando S. Osório USP

DOI: https://doi.org/10.5753/semish.2023.230573

Resumo

Segmentação semântica por meio de Deep Learning tem extrema importância na percepção visual para robôs móveis autônomos. Contudo, grande parte da pesquisa atual se baseia percepção quadro-a-quadro. Tal abordagem, além de negligenciar as possibilidades oferecidas pelo uso de dados temporais, resulta em modelos instáveis. Diante disso, e do alto custo do processo de rotulação, novas alternativas de aprendizado exploram a ampla disponibilidade de dados temporais não-rotulados. O presente trabalho estuda a aplicação de supervisão auxiliar auto-supervisionada para promoção da estabilidade temporal em modelos de segmentação. Os resultados demonstram que tal estratégia promove a precisão e estabilidade, mesmo utilizando dados de bases distintas.

Referências

Badrinarayanan, V., Kendall, A., and Cipolla, R. (2017). Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(12):2481–2495.

Contributors, M. (2020). MMSegmentation: Openmmlab semantic segmentation toolbox and benchmark. https://github.com/open-mmlab/mmsegmentation.

Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., and Schiele, B. (2016). The cityscapes dataset for semantic urban scene understanding. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

Everingham, M., Eslami, S. M. A., Van Gool, L., Williams, C. K. I., Winn, J., and Zisserman, A. (2015). The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision, 111(1):98–136.

Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., and Brox, T. (2017). Flownet 2.0: Evolution of optical flow estimation with deep networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1647–1655.

Lee, H.-Y., Huang, J.-B., Singh, M., and Yang, M.-H. (2017). Unsupervised representation learning by sorting sequences. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 667–676.

Li, J., Wang, W., Chen, J., Niu, L., Si, J., Qian, C., and Zhang, L. (2021). Video semantic segmentation via sparse temporal transformer. In Proceedings of the 29th ACM International Conference on Multimedia, MM ’21, page 59–68, New York, NY, USA. Association for Computing Machinery.

Liu, Y., Shen, C., Yu, C., and Wang, J. (2020). Efficient semantic video segmentation with per-frame inference. In Vedaldi, A., Bischof, H., Brox, T., and Frahm, J.-M., editors, Computer Vision – ECCV 2020, pages 352–368, Cham. Springer International Publishing.

Long, J., Shelhamer, E., and Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3431–3440.

Oršić, M. and Šegvić, S. (2021). Efficient semantic segmentation with pyramidal fusion. Pattern Recognition, 110:107611.

Ronneberger, O., Fischer, P., and Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. In Navab, N., Hornegger, J., Wells, W. M., and Frangi, A. F., editors, Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015, pages 234–241, Cham. Springer International Publishing.

Seyedhosseini, M. and Tasdizen, T. (2016). Semantic image segmentation with contextual hierarchical models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(5):951–964.

Shi, W., Xu, J., Zhu, D., Zhang, G., Wang, X., Li, J., and Zhang, X. (2022). Rgb-d semantic segmentation and label-oriented voxelgrid fusion for accurate 3d semantic mapping. IEEE Transactions on Circuits and Systems for Video Technology, 32(1):183–197.

Shinzato, P. Y. and Wolf, D. F. (2010). Statistical analysis of image-features used as inputs of an road identifier based in artificial neural networks. In 2010 Latin American Robotics Symposium and Intelligent Robotics Meeting, pages 19–24.

Varghese, S., Bayzidi, Y., Bär, A., Kapoor, N., Lahiri, S., Schneider, J. D., Schmidt, N., Schlicht, P., Hüger, F., and Fingscheidt, T. (2020). Unsupervised temporal consistency metric for video segmentation in highly-automated driving. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 1369–1378.

Varghese, S., Gujamagadi, S., Klingner, M., Kapoor, N., Bär, A., Schneider, J. D., Maag, K., Schlicht, P., Hüger, F., and Fingscheidt, T. (2021). An unsupervised temporal consistency (tc) loss to improve the performance of semantic segmentation networks. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 12–20.

Xie, J., Kiefel, M., Sun, M.-T., and Geiger, A. (2016). Semantic instance annotation of street scenes by 3d to 2d label transfer. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3688–3697.

Xiong, J., Po, L.-M., Yu, W. Y., Zhao, Y., and Cheung, K.-W. (2021). Distortion map-guided feature rectification for efficient video semantic segmentation. IEEE Transactions on Multimedia, pages 1–1.

Yu, C., Gao, C., Wang, J., Yu, G., Shen, C., and Sang, N. (2021). Bisenet v2: Bilateral network with guided aggregation for real-time semantic segmentation. International Journal of Computer Vision, 129(11):3051–3068.

Yu, C., Wang, J., Peng, C., Gao, C., Yu, G., and Sang, N. (2018). Bisenet: Bilateral segmentation network for real-time semantic segmentation. In Ferrari, V., Hebert, M., Sminchisescu, C., and Weiss, Y., editors, Computer Vision – ECCV 2018, pages 334–349, Cham. Springer International Publishing.

Zhang, R., Isola, P., and Efros, A. A. (2016). Colorful image colorization. In Leibe, B., Matas, J., Sebe, N., and Welling, M., editors, Computer Vision – ECCV 2016, pages 649–666, Cham. Springer International Publishing.