An Initial Study of Bird’s-Eye View Generation for Autonomous Vehicles using Cross-View Transformers

  • Felipe Carlos dos Santos UFSC
  • Eric Aislan Antonelo UFSC
  • Gustavo Claudio Karl Couto UFSC

Abstract


Bird’s-Eye View (BEV) maps provide a structured, top-down abstraction that is crucial for autonomous-driving perception. In this work, we employ Cross-View Transformers (CVT) for learning to map camera images to three BEV’s channels - road, lane markings, and planned trajectory - using a realistic simulator for urban driving. Our study examines generalization to unseen towns, the effect of different camera layouts, and two loss formulations (focal and L1). Using training data from only a town, a four-camera CVT trained with the L1 loss delivers the most robust test performance, evaluated in a new town. Overall, our results underscore CVT’s promise for mapping camera inputs to reasonably accurate BEV maps.

References

(2020). Carla autonomous driving leaderboard. [link].

Antonelo, E. A., Couto, G. C. K., Möller, C., and Fernandes, P. H. (2024). Investigating behavior cloning from few demonstrations for autonomous driving based on bird’s-eye view in simulated cities. In BRACIS, pages 155–168. Springer.

Codevilla, F., Müller, M., Dosovitskiy, A., López, A., and Koltun, V. (2019). End-to-end driving via conditional imitation learning. In ICRA.

Couto, G. C. K. and Antonelo, E. A. (2023). Hierarchical generative adversarial imitation learning with mid-level input generation for autonomous driving on urban environments. arXiv preprint arXiv:2302.04823.

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.

Hu, A., Murez, Z., Mohan, N., Dudas, S., Hawke, J., Badrinarayanan, V., Cipolla, R., and Kendall, A. (2021). Fiery: Future instance prediction in bird’s-eye view from surround monocular cameras. In Proceedings of the IEEE/CVF ICCV, pages 15273–15282.

Liang, C., Xiao, M., Wang, Y., et al. (2024). Widthformer: Fast wide-field bev perception. In IEEE/CVF CVPR.

Lin, T.-Y., Goyal, P., Girshick, R., He, K., and Dollár, P. (2017). Focal loss for dense object detection. In Proceedings of the IEEE ICCV, pages 2980–2988.

Müller, J., Schneider, L., and Dengel, A. (2023). Efficient lane transformer for bird’s-eye-view lane detection. In IEEE Intelligent Vehicles Symposium (IV).

Orhan, A. E. and Pitkow, X. (2017). Skip connections eliminate singularities. arXiv preprint arXiv:1701.09175.

Pan, B., Sun, J., Leung, H. Y. T., Andonian, A., and Zhou, B. (2020). Cross-view semantic segmentation for sensing surroundings. IEEE RA-L, 5(3):4867–4873.

Park, S., Hong, S., Kim, S., and Lee, K. (2023). Cross-view transformers for real-time map-view semantic segmentation. In IEEE/CVF CVPR.

Philion, J. and Fidler, S. (2020). Lift, splat, shoot: A multicamera bird’s-eye view for 3d object detection. In European Conference on Computer Vision (ECCV).

Qiao, D., Zulkernine, F., and Anand, A. (2024). Cobevfusion cooperative perception with lidar-camera bird’s eye view fusion. In 2024 International Conference on Digital Image Computing: Techniques and Applications (DICTA), pages 389–396. IEEE.

Ronneberger, O., Fischer, P., and Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. In MICCAI 2015, proceedings, part III 18, pages 234–241. Springer.

Tan, M. and Le, Q. (2019). Efficientnet: Rethinking model scaling for convolutional neural networks. In ICML, pages 6105–6114. PMLR.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.

Wang, K., Luo, J., Hu, X., et al. (2023). Baeformer: Bidirectional aggregation of cross-view features for bev lane segmentation. In IEEE ITSC.

Wang, T.-H., Manivasagam, S., Liang, M., Yang, B., Zeng, W., and Urtasun, R. (2020). V2vnet: Vehicle-to-vehicle communication for joint perception and prediction. In ECCV 2020 proceedings, part II 16, pages 605–621. Springer.

Xu, R., Tu, Z., Xiang, H., Shao, W., Zhou, B., and Ma, J. (2022). Cobevt: Cooperative bird’s eye view semantic segmentation with sparse transformers. arXiv preprint arXiv:2207.02202.

Zhou, T., Fang, L., Chen, Z., et al. (2022). Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In ECCV.
Published
2025-09-29
SANTOS, Felipe Carlos dos; ANTONELO, Eric Aislan; COUTO, Gustavo Claudio Karl. An Initial Study of Bird’s-Eye View Generation for Autonomous Vehicles using Cross-View Transformers. In: NATIONAL MEETING ON ARTIFICIAL AND COMPUTATIONAL INTELLIGENCE (ENIAC), 22. , 2025, Fortaleza/CE. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2025 . p. 855-866. ISSN 2763-9061. DOI: https://doi.org/10.5753/eniac.2025.14251.