Cost-Efficient Visual Perception for Autonomous Vehicles: Leveraging Attention-Based Sensor Fusion to Maintain Performance

Eduardo Sperle Honorato; Vanderlei Bonato; Denis Fernando Wolf

Eduardo Sperle Honorato USP
Vanderlei Bonato USP
Denis Fernando Wolf USP

Resumo

Autonomous vehicles rely on sophisticated perception systems to ensure safe navigation and decision-making. However, state-of-the-art sensor fusion models often demand extensive computational resources, hindering their deployment on cost-effective hardware. In this work, we address this challenge by modifying the BEVFusion framework to significantly reduce computational costs while maintaining high performance in 3D object detection and segmentation. Specifically, we replace the resource-intensive SwinTransformer backbone with the more efficient ResNet50 and integrate attention-based sensor fusion—leveraging channel and spatial attention mechanisms—to dynamically focus on the most relevant features. This approach reduces VRAM usage from 80 GB to approximately 20 GB, cuts training time from 20 d to 6 d, and boosts inference speed by up to 17.3% on lower-power GPUs. Experimental results on the nuScenes dataset demonstrate a 0.732% improvement in mean Average Precision (mAP) for 3D object detection, along with a 14.12% increase in mean Intersection over Union (mIoU) for semantic segmentation compared to the original BEVFusion model. These improvements underscore the feasibility of deploying advanced visual perception systems on more accessible hardware for real-world autonomous driving and mobile robotics applications.