CLIP-BEVFormer: Enhancing Multi-View Image-Based BEV Detector with Ground Truth Flow

Chenbin Pan; Burhaneddin Yaman; Senem Velipasalar; Liu Ren

2024 CVPR CVPR 2024

CLIP-BEVFormer: Enhancing Multi-View Image-Based BEV Detector with Ground Truth Flow

Abstract

Autonomous driving stands as a pivotal domain in computer vision shaping the future of transportation. Within this paradigm the backbone of the system plays a crucial role in interpreting the complex environment. However a notable challenge has been the loss of clear supervision when it comes to Bird's Eye View elements. To address this limitation we introduce CLIP-BEVFormer a novel approach that leverages the power of contrastive learning techniques to enhance the multi-view image-derived BEV backbones with ground truth information flow. We conduct extensive experiments on the challenging nuScenes dataset and showcase significant and consistent improvements over the SOTA. Specifically CLIP-BEVFormer achieves an impressive 8.5% and 9.2% enhancement in terms of NDS and mAP respectively over the previous best BEV model on the 3D object detection task.

🌉 Interdisciplinary Bridge — Computer Vision and Deep Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Chenbin Pan , Burhaneddin Yaman , Senem Velipasalar , Liu Ren

Topics

Computer Vision > Analysis > Object Detection Computer Vision > Domain-Specific > Autonomous Driving Deep Learning > Learning Types > Contrastive Learning

Keywords

contrastive learning autonomous driving bird's eye view 3d object detection multi-view image multi-view camera bird eye view bird's eye view detection

Download PDF

Related papers

DUSt3R: Geometric 3D Vision Made Easy 2024

Bezier Everywhere All at Once: Learning Drivable Lanes as Bezier Graphs 2024

NeRFDeformer: NeRF Transformation from a Single View via 3D Scene Flows 2024

Unleashing Unlabeled Data: A Paradigm for Cross-View Geo-Localization 2024

DIMAT: Decentralized Iterative Merging-And-Training for Deep Learning Models 2024