RoScenes: A Large-scale Multi-view 3D Dataset for Roadside Perception

AI-generated keywords: RoScenes Bird's Eye View BEV-to-3D joint annotation pipeline RoBEV method 3D object annotations

AI-generated Key Points

  • RoScenes is the largest multi-view roadside perception dataset designed to advance Bird's Eye View (BEV) approaches for complex traffic scenes.
  • It features expansive perception area, comprehensive scene coverage, and dense traffic scenarios.
  • Contains 21.13 million 3D annotations within a compact 64,000 $m^2$ area.
  • Utilizes a novel BEV-to-3D joint annotation pipeline to efficiently gather data while addressing challenges of costly roadside 3D labeling.
  • Current BEV methods evaluated on RoScenes show limitations in handling extensive perception areas and diverse sensor layouts across scenes, leading to subpar performance levels.
  • RoBEV method proposed with feature-guided position embedding for effective 2D-3D feature assignment surpasses existing state-of-the-art methods without additional computational overhead on the validation set.
  • Detailed statistics and analysis include camera parameters such as occlusion levels, focal length, pitch angle, mounting height, and road coverage.
  • Refined BEV annotations implemented to mitigate perspective distortions and jittering effects from UAV imagery.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Xiaosu Zhu, Hualian Sheng, Sijia Cai, Bing Deng, Shaopeng Yang, Qiao Liang, Ken Chen, Lianli Gao, Jingkuan Song, Jieping Ye

Technical report. 32 pages, 21 figures, 13 tables. https://github.com/xiaosu-zhu/RoScenes
License: CC BY-NC-SA 4.0

Abstract: We introduce RoScenes, the largest multi-view roadside perception dataset, which aims to shed light on the development of vision-centric Bird's Eye View (BEV) approaches for more challenging traffic scenes. The highlights of RoScenes include significantly large perception area, full scene coverage and crowded traffic. More specifically, our dataset achieves surprising 21.13M 3D annotations within 64,000 $m^2$. To relieve the expensive costs of roadside 3D labeling, we present a novel BEV-to-3D joint annotation pipeline to efficiently collect such a large volume of data. After that, we organize a comprehensive study for current BEV methods on RoScenes in terms of effectiveness and efficiency. Tested methods suffer from the vast perception area and variation of sensor layout across scenes, resulting in performance levels falling below expectations. To this end, we propose RoBEV that incorporates feature-guided position embedding for effective 2D-3D feature assignment. With its help, our method outperforms state-of-the-art by a large margin without extra computational overhead on validation set. Our dataset and devkit will be made available at \url{https://github.com/xiaosu-zhu/RoScenes}.

Submitted to arXiv on 16 May. 2024

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2405.09883v1

RoScenes is the largest multi-view roadside perception dataset designed to advance Bird's Eye View (BEV) approaches for complex traffic scenes. It stands out for its expansive perception area, comprehensive scene coverage, and dense traffic scenarios. With an impressive 21.13 million 3D annotations within a compact 64,000 $m^2$ area, RoScenes utilizes a novel BEV-to-3D joint annotation pipeline to efficiently gather this vast amount of data while addressing the challenges of costly roadside 3D labeling. A thorough evaluation of current BEV methods on RoScenes reveals limitations in handling the extensive perception area and diverse sensor layouts across scenes, leading to subpar performance levels. In response, the RoBEV method is proposed with feature-guided position embedding for effective 2D-3D feature assignment. This approach surpasses existing state-of-the-art methods without additional computational overhead on the validation set. The dataset includes detailed statistics and analysis showcasing camera parameters such as occlusion levels, focal length, pitch angle, mounting height, and road coverage. Additionally, refined BEV annotations are implemented to mitigate perspective distortions and jittering effects from UAV imagery.
Created on 29 Jun. 2026

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.