LIFT-SLAM: a deep-learning feature-based monocular visual SLAM method

Hudson M. S. Bruno; Esther L. Colombini

doi:10.5753/wtdr_ctdr.2020.14954

Hudson M. S. Bruno UNICAMP
Esther L. Colombini UNICAMP http://orcid.org/0000-0003-0467-3133

DOI: https://doi.org/10.5753/wtdr_ctdr.2020.14954

Resumo

The Simultaneous Localization and Mapping (SLAM) problem addresses the possibility of a robot to localize itself in an unknown environment and simultaneously build a consistent map of this environment. Recently, cameras have been successfully used to get the environment’s features to perform SLAM, which is referred to as visual SLAM (VSLAM). However, classical VSLAM algorithms can be easily induced to fail when the robot motion or the environment is too challenging. Although new approaches based on Deep Neural Networks (DNNs) have achieved promising results in VSLAM, they still are unable to outperform traditional methods. To leverage the robustness of deep learning to enhance traditional VSLAM systems, we propose to combine the potential of deep learning-based feature descriptors with the traditional geometry-based VSLAM, building a new VSLAM system called LIFT-SLAM. Experiments conducted on KITTI and Euroc datasets show that deep learning can be used to improve the performance of traditional VSLAM systems, as the proposed approach was able to achieve results comparable to the state-of-the-art while being robust to sensorial noise. We enhance the proposed VSLAM pipeline by avoiding parameter tuning for specific datasets with an adaptive approach while evaluating how transfer learning can affect the quality of the features extracted.

Palavras-chave: Visual SLAM, Hybrid Mehtods, Deep Learning

Referências

R. Li, S. Wang, Z. Long, D. Gu, Undeepvo: Monocular visual odometry through unsupervised deep learning, CoRR abs/1709.06841 (2017). arXiv:1709.06841.

E. Parisotto, D. S. Chaplot, J. Zhang, R. Salakhutdinov, Global pose estimation with an attention-based recurrent network, IEEE/CVF CVPRW (2018) 350–359.

S. Wang, R. Clark, H. Wen, N. Trigoni, Deepvo: Towards end-to-end visual odometry with deep recurrent convolutional neural networks, in: IEEE ICRA, 2017, pp. 2043–2050.

D. DeTone, T. Malisiewicz, A. Rabinovich, Self-improving visual odometry, CoRR abs/1812.03245 (2018).

R. Kang, J. Shi, X. Li, Y. Liu, X. Liu, DF-SLAM: A deep-learning enhanced visual SLAM system based on deep local features, CoRR abs/1901.07223 (2019). arXiv:1901.07223.

Y. Li, Y. Ushiku, T. Harada, Pose graph optimization for unsupervised monocular visual odometry, in: IEEE ICRA, 2019, pp. 5439–5445.

K. M. Yi, E. Trulls, V. Lepetit, P. Fua, Lift: Learned invariant feature transform., in: ECCV, Vol. 9910, Springer, 2016, pp. 467–483.

R. Mur-Artal, J. M. M. Montiel, J. D. Tardós, ORB-SLAM: a versatile and accurate monocular SLAM system, IEEE Transactions on Robotics 31 (5) (2015) 1147–1163.

A. Geiger, P. Lenz, C. Stiller, R. Urtasun, Vision meets robotics: The kitti dataset, INT J ROBOT RES 32 (11) (2013) 1231–1237.

M. Burri, J. Nikolic, P. Gohl, T. Schneider, J. Rehder, S. Omari, M. W. Achtelik, R. Siegwart, The euroc mav datasets, INT J ROBOT RES (2016).

G. Klein, D. Murray, Parallel tracking and mapping for small ar workspaces, 6th IEEE and ACM International Symposium (2007) 225–234.

T. Taketomi, H. Uchiyama, S. Ikeda, Visual slam algorithms: a survey from 2010 to 2016, IPSJ Transactions on Computer Vision and Applications 9:16 (2017).

R. Mur-Artal, J. Tardos, Orb-slam2: an open-source slam system for monocular, stereo and rgb-d cameras, IEEE Transactions on Robotics (10 2016).

A. Dosovitskiy, P. Fischer, E. Ilg, P. Häusser, C. Hazirbas, V. Golkov, P. v. d. Smagt, D. Cremers, T. Brox, Flownet: Learning optical flow with convolutional networks, in: 2015 IEEE International Conference on Computer Vision (ICCV), 2015, pp. 2758–2766. doi:10.1109/ICCV.2015.316.

P.-E. Sarlin, D. DeTone, T. Malisiewicz, A. Rabinovich, Superglue: Learning feature matching with graph neural networks, in: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 4938–4947.

V. Balntas, E. Riba, D. Ponsa, K. Mikolajczyk, Learning local feature descriptors with triplets and shallow convolutional neural networks, in: Proceedings of the British Machine Vision Conference (BMVC), BMVA Press, 2016.

O. Chapelle, M. Wu, Gradient descent optimization of smoothed information retrieval metrics, Information Retrieval 13 (3) (2010) 216–235.

C. Wu, Towards linear-time incremental structure from motion, in: International Conference on 3D Vision 3DV, 2013, pp. 127–134.

D. Galvez-López, J. D. Tardos, Bags of binary words for fast place recognition in image sequences, IEEE Transactions on Robotics 28 (5) (2012) 1188–1197.

J. Engel, V. Usenko, D. Cremers, A photometrically calibrated benchmark for monocular visual odometry, Vol. abs/1607.02555, 2016. arXiv:1607.02555.

R. Mur-Artal, J. Tardos, Fast relocalisation and loop closing in keyframe-based slam, IEEE ICRA (2014) 846–853.7

D. Nistér, H. Stewenius, Scalable recognition with a vocabulary tree, in: IEEE CVPR, Vol. 2, IEEE Computer Society, 2006, p. 2161–2168.

A. Geiger, P. Lenz, R. Urtasun, Are we ready for autonomous driving? the kitti vision benchmark suite, in: IEEE CVPR, 2012.

C. R. Steffens, L. R. V. Messias, P. L. J. Drews, S. S. d. C. Botelho, Can exposure, noise and compression affect image recognition? an assessment of the impacts on state-of-the-art convnets, in: IEEE LARS/SBR, 2019, pp. 61–66.