StereoVGGT: A Training-Free Visual Geometry Transformer for Stereo Vision
arXiv:2603.29368v1 Announce Type: new Abstract: Driven by the advancement of 3D devices, stereo vision tasks including stereo matching and stereo conversion have emerged as a critical research frontier. Contemporary stereo vision backbones typically rely on either monocular depth estimation (MDE) models or visual foundation models (VFMs). Crucially, these models are predominantly pretrained without explicit supervision of camera poses. Given that such geometric knowledge is indispensable for stereo vision, the absence of explicit spatial constraints constitutes a significant performance bottleneck for existing architectures. Recognizing that the Visual Geometry Grounded Transformer (VGGT) operates as a foundation model pretrained on extensive 3D priors, including camera poses, we investiga
View PDF HTML (experimental)
Abstract:Driven by the advancement of 3D devices, stereo vision tasks including stereo matching and stereo conversion have emerged as a critical research frontier. Contemporary stereo vision backbones typically rely on either monocular depth estimation (MDE) models or visual foundation models (VFMs). Crucially, these models are predominantly pretrained without explicit supervision of camera poses. Given that such geometric knowledge is indispensable for stereo vision, the absence of explicit spatial constraints constitutes a significant performance bottleneck for existing architectures. Recognizing that the Visual Geometry Grounded Transformer (VGGT) operates as a foundation model pretrained on extensive 3D priors, including camera poses, we investigate its potential as a robust backbone for stereo vision tasks. Nevertheless, empirical results indicate that its direct application to stereo vision yields suboptimal performance. We observe that VGGT suffers from a more significant degradation of geometric details during feature extraction. Such characteristics conflict with the requirements of binocular stereo vision, thereby constraining its efficacy for relative tasks. To bridge this gap, we propose StereoVGGT, a feature backbone specifically tailored for stereo vision. By leveraging the frozen VGGT and introducing a training-free feature adjustment pipeline, we mitigate geometric degradation and harness the latent camera calibration knowledge embedded within the model. StereoVGGT-based stereo matching network achieved the $1^{st}$ rank among all published methods on the KITTI benchmark, validating that StereoVGGT serves as a highly effective backbone for stereo vision.
Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Cite as: arXiv:2603.29368 [cs.CV]
(or arXiv:2603.29368v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2603.29368
arXiv-issued DOI via DataCite (pending registration)
Submission history
From: Ziyang Chen [view email] [v1] Tue, 31 Mar 2026 07:41:07 UTC (4,909 KB)
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
modeltransformerfoundation model
Vibe analyzing my genome
The most interesting and useful result concerns drug metabolism; you can skip to that section here . I would like to reduce the up and down fluctuations of my Bipolar. They're simultaneously not that bad, but also bad enough to be unpleasant, disruptive, and to rob me of many productive hours. Bipolar, for all its prevalence and study [1] , is not yet well understood. It's maybe something something a circadian rhythm disorder. It's maybe something something inflammatory. It's maybe a single disease with somewhat different presentations, or maybe it's multiple diseases with somewhat similar presentation. I hoped that if I could look at my own genes and see which were anomalous, perhaps I could pin down what's going wrong in my mind. If it seems to be clock gene-related, I'll emphasize circa

Anthropic reveals $30bn run rate and plans to use 3.5GW of new Google AI chips
Broadcom's building the silicon and is chuffed about that, but also notes Anthropic remains a risk Broadcom has announced that Google has asked it to build next-generation AI and datacenter networking chips, and that Anthropic plans to consume 3.5GW worth of the accelerators it delivers to the ads and search giant.…

Got Gemma 4 running locally on CUDA, both float and GGUF quantized, with benchmarks
Spent the last week getting Gemma 4 working on CUDA with both full-precision (BF16) and GGUF quantized inference. Here's a video of it running. Sharing some findings because this model has some quirks that aren't obvious. Performance (Gemma4 E2B, RTX 3090): | Config | BF16 Float | Q4_K_M GGUF | |-------------------------|------------|-------------| | short gen (p=1, g=32) | 110 tok/s | 170 tok/s | | long gen (p=512, g=128) | 72 tok/s | 93 tok/s | The precision trap nobody warns you about Honestly making it work was harder than I though. Gemma 4 uses attention_scale=1.0 (QK-norm instead of the usual 1/sqrt(d_k) scaling). This makes it roughly 22x more sensitive to precision errors than standard transformers. Things that work fine on LLaMA or Qwen will silently produce garbage on Gemma 4: F1
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Models

Lightweight liquid neural networks decipher salivary metabolic fingerprinting for high-risk periodontitis screening in diabetes
npj Digital Medicine, Published online: 07 April 2026; doi:10.1038/s41746-026-02593-7 Lightweight liquid neural networks decipher salivary metabolic fingerprinting for high-risk periodontitis screening in diabetes


Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!