STream3R: Scalable Sequential 3D Reconstruction with Causal Transformer

AI-generated keywords: 3D reconstruction STream3R Transformer streaming framework real-time

AI-generated Key Points

The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

  • Authors: Yushi Lan, Yihang Luo, Fangzhou Hong, Shangchen Zhou, Honghua Chen, Zhaoyang Lyu, Shuai Yang, Bo Dai, Chen Change Loy, and Xingang Pan
  • Approach: Introduces a groundbreaking approach to 3D reconstruction called STream3R
  • Innovation: Reimagines pointmap prediction as a decoder-only problem using causal attention
  • Generalization: Demonstrates strong generalization capabilities across diverse and challenging scenarios by leveraging geometric priors learned from extensive 3D datasets
  • Performance: Outperforms prior approaches in both static and dynamic scene benchmarks
  • Compatibility: Compatible with Large Language Model (LLM)-style training infrastructure for efficient large-scale pretraining and fine-tuning for various downstream 3D tasks
  • Potential: Underscores the potential of causal models for real-time 3D perception in streaming environments
  • Further Details: More information available on the project page at https://nirvanalan.github.io/projects/stream3r
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Yushi Lan, Yihang Luo, Fangzhou Hong, Shangchen Zhou, Honghua Chen, Zhaoyang Lyu, Shuai Yang, Bo Dai, Chen Change Loy, Xingang Pan

TL;DR: Streaming 4D reconstruction using causal transformer. Project page: https://nirvanalan.github.io/projects/stream3r
License: CC BY-NC-ND 4.0

Abstract: We present STream3R, a novel approach to 3D reconstruction that reformulates pointmap prediction as a decoder-only Transformer problem. Existing state-of-the-art methods for multi-view reconstruction either depend on expensive global optimization or rely on simplistic memory mechanisms that scale poorly with sequence length. In contrast, STream3R introduces an streaming framework that processes image sequences efficiently using causal attention, inspired by advances in modern language modeling. By learning geometric priors from large-scale 3D datasets, STream3R generalizes well to diverse and challenging scenarios, including dynamic scenes where traditional methods often fail. Extensive experiments show that our method consistently outperforms prior work across both static and dynamic scene benchmarks. Moreover, STream3R is inherently compatible with LLM-style training infrastructure, enabling efficient large-scale pretraining and fine-tuning for various downstream 3D tasks. Our results underscore the potential of causal Transformer models for online 3D perception, paving the way for real-time 3D understanding in streaming environments. More details can be found in our project page: https://nirvanalan.github.io/projects/stream3r.

Submitted to arXiv on 14 Aug. 2025

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2508.10893v1

This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

In their paper titled "STream3R: Scalable Sequential 3D Reconstruction with Causal Transformer," authors Yushi Lan, Yihang Luo, Fangzhou Hong, Shangchen Zhou, Honghua Chen, Zhaoyang Lyu, Shuai Yang, Bo Dai, Chen Change Loy, and Xingang Pan introduce a groundbreaking approach to 3D reconstruction. The method reimagines pointmap prediction as a decoder-only problem. It departs from existing techniques that rely on costly global optimization or limited memory mechanisms that struggle with longer sequences. The key innovation of lies in its , which efficiently processes image sequences using causal attention inspired by advancements in modern language modeling. By leveraging geometric priors learned from extensive 3D datasets, demonstrates strong generalization capabilities across diverse and challenging scenarios. Particularly noteworthy is its success in capturing dynamic scenes where traditional methods often fall short. Extensive experiments showcased the superior performance of compared to prior approaches across both static and dynamic scene benchmarks. Furthermore, the method's compatibility with Large Language Model (LLM)-style training infrastructure enables efficient large-scale pretraining and fine-tuning for various downstream 3D tasks. The results presented in this study underscore the potential of causal models for real-time 3D perception in streaming environments. The authors provide further details on their project page at https://nirvanalan.github.io/projects/stream3r. This research opens up exciting possibilities for advancing online 3D understanding and paves the way for enhanced capabilities in real-time 3D reconstruction applications.
Created on 21 Aug. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.