Models model transformer foundation model benchmark training announce

StereoVGGT: A Training-Free Visual Geometry Transformer for Stereo Vision

arXiv cs.CVby Ziyang Chen, Yansong Qu, You Shen, Xuan Cheng, Liujuan CaoApril 1, 20262 min read0 views

arXiv:2603.29368v1 Announce Type: new Abstract: Driven by the advancement of 3D devices, stereo vision tasks including stereo matching and stereo conversion have emerged as a critical research frontier. Contemporary stereo vision backbones typically rely on either monocular depth estimation (MDE) models or visual foundation models (VFMs). Crucially, these models are predominantly pretrained without explicit supervision of camera poses. Given that such geometric knowledge is indispensable for stereo vision, the absence of explicit spatial constraints constitutes a significant performance bottleneck for existing architectures. Recognizing that the Visual Geometry Grounded Transformer (VGGT) operates as a foundation model pretrained on extensive 3D priors, including camera poses, we investiga

View PDF HTML (experimental)

Abstract:Driven by the advancement of 3D devices, stereo vision tasks including stereo matching and stereo conversion have emerged as a critical research frontier. Contemporary stereo vision backbones typically rely on either monocular depth estimation (MDE) models or visual foundation models (VFMs). Crucially, these models are predominantly pretrained without explicit supervision of camera poses. Given that such geometric knowledge is indispensable for stereo vision, the absence of explicit spatial constraints constitutes a significant performance bottleneck for existing architectures. Recognizing that the Visual Geometry Grounded Transformer (VGGT) operates as a foundation model pretrained on extensive 3D priors, including camera poses, we investigate its potential as a robust backbone for stereo vision tasks. Nevertheless, empirical results indicate that its direct application to stereo vision yields suboptimal performance. We observe that VGGT suffers from a more significant degradation of geometric details during feature extraction. Such characteristics conflict with the requirements of binocular stereo vision, thereby constraining its efficacy for relative tasks. To bridge this gap, we propose StereoVGGT, a feature backbone specifically tailored for stereo vision. By leveraging the frozen VGGT and introducing a training-free feature adjustment pipeline, we mitigate geometric degradation and harness the latent camera calibration knowledge embedded within the model. StereoVGGT-based stereo matching network achieved the $1^{st}$ rank among all published methods on the KITTI benchmark, validating that StereoVGGT serves as a highly effective backbone for stereo vision.

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as: arXiv:2603.29368 [cs.CV]

(or arXiv:2603.29368v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2603.29368

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Ziyang Chen [view email] [v1] Tue, 31 Mar 2026 07:41:07 UTC (4,909 KB)

Original source

arXiv cs.CV

https://arxiv.org/abs/2603.29368

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

modeltransformerfoundation model

Analyst NewsLive

Vibe analyzing my genome

The most interesting and useful result concerns drug metabolism; you can skip to that section here . I would like to reduce the up and down fluctuations of my Bipolar. They're simultaneously not that bad, but also bad enough to be unpleasant, disruptive, and to rob me of many productive hours. Bipolar, for all its prevalence and study [1] , is not yet well understood. It's maybe something something a circadian rhythm disorder. It's maybe something something inflammatory. It's maybe a single disease with somewhat different presentations, or maybe it's multiple diseases with somewhat similar presentation. I hoped that if I could look at my own genes and see which were anomalous, perhaps I could pin down what's going wrong in my mind. If it seems to be clock gene-related, I'll emphasize circa

lesswrong.com

22m9 minutes ago

ReleasesFresh

Anthropic reveals $30bn run rate and plans to use 3.5GW of new Google AI chips

Broadcom's building the silicon and is chuffed about that, but also notes Anthropic remains a risk Broadcom has announced that Google has asked it to build next-generation AI and datacenter networking chips, and that Anthropic plans to consume 3.5GW worth of the accelerators it delivers to the ads and search giant.…

The Register AI/ML

1mabout 2 hours ago

ModelsFresh

Got Gemma 4 running locally on CUDA, both float and GGUF quantized, with benchmarks

Spent the last week getting Gemma 4 working on CUDA with both full-precision (BF16) and GGUF quantized inference. Here's a video of it running. Sharing some findings because this model has some quirks that aren't obvious. Performance (Gemma4 E2B, RTX 3090): | Config | BF16 Float | Q4_K_M GGUF | |-------------------------|------------|-------------| | short gen (p=1, g=32) | 110 tok/s | 170 tok/s | | long gen (p=512, g=128) | 72 tok/s | 93 tok/s | The precision trap nobody warns you about Honestly making it work was harder than I though. Gemma 4 uses attention_scale=1.0 (QK-norm instead of the usual 1/sqrt(d_k) scaling). This makes it roughly 22x more sensitive to precision errors than standard transformers. Things that work fine on LLaMA or Qwen will silently produce garbage on Gemma 4: F1

Reddit r/LocalLLaMA

2mabout 6 hours ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 214 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Models

ModelsFresh

Lightweight liquid neural networks decipher salivary metabolic fingerprinting for high-risk periodontitis screening in diabetes

npj Digital Medicine, Published online: 07 April 2026; doi:10.1038/s41746-026-02593-7 Lightweight liquid neural networks decipher salivary metabolic fingerprinting for high-risk periodontitis screening in diabetes