Live
Black Hat USADark ReadingBlack Hat AsiaAI BusinessVibe analyzing my genomelesswrong.comChina cuts cost of military-grade infrared chips to as little as a few dozen USDSCMP Tech (Asia AI)MethodologyDEV CommunityHow to Create a Pipeline with Dotflow in PythonDEV CommunityJava + AI: Beyond APIs: into runtime, performance, and system designDEV Communityv0.20.3-rc0: model/parsers: add gemma4 tool call repair (#15374)Ollama ReleasesThe Indianapolis Data Center Shooting Is a Local Bug ReportDEV CommunityWriting Self-Documenting TypeScript: Naming, Narrowing, and Knowing When to StopDEV CommunityDiscussion: AI and Privacy-First DevelopmentDEV CommunityDiscussion: AI & Machine Learning CategoryDEV CommunitySecuring Plex on Synology NAS with Post-Quantum Cryptography via Cloudflare TunnelDEV CommunityResume Skills Section: Best Layout + Examples (2026)DEV CommunityBlack Hat USADark ReadingBlack Hat AsiaAI BusinessVibe analyzing my genomelesswrong.comChina cuts cost of military-grade infrared chips to as little as a few dozen USDSCMP Tech (Asia AI)MethodologyDEV CommunityHow to Create a Pipeline with Dotflow in PythonDEV CommunityJava + AI: Beyond APIs: into runtime, performance, and system designDEV Communityv0.20.3-rc0: model/parsers: add gemma4 tool call repair (#15374)Ollama ReleasesThe Indianapolis Data Center Shooting Is a Local Bug ReportDEV CommunityWriting Self-Documenting TypeScript: Naming, Narrowing, and Knowing When to StopDEV CommunityDiscussion: AI and Privacy-First DevelopmentDEV CommunityDiscussion: AI & Machine Learning CategoryDEV CommunitySecuring Plex on Synology NAS with Post-Quantum Cryptography via Cloudflare TunnelDEV CommunityResume Skills Section: Best Layout + Examples (2026)DEV Community
AI NEWS HUBbyEIGENVECTOREigenvector

StereoVGGT: A Training-Free Visual Geometry Transformer for Stereo Vision

arXiv cs.CVby Ziyang Chen, Yansong Qu, You Shen, Xuan Cheng, Liujuan CaoApril 1, 20262 min read0 views
Source Quiz

arXiv:2603.29368v1 Announce Type: new Abstract: Driven by the advancement of 3D devices, stereo vision tasks including stereo matching and stereo conversion have emerged as a critical research frontier. Contemporary stereo vision backbones typically rely on either monocular depth estimation (MDE) models or visual foundation models (VFMs). Crucially, these models are predominantly pretrained without explicit supervision of camera poses. Given that such geometric knowledge is indispensable for stereo vision, the absence of explicit spatial constraints constitutes a significant performance bottleneck for existing architectures. Recognizing that the Visual Geometry Grounded Transformer (VGGT) operates as a foundation model pretrained on extensive 3D priors, including camera poses, we investiga

View PDF HTML (experimental)

Abstract:Driven by the advancement of 3D devices, stereo vision tasks including stereo matching and stereo conversion have emerged as a critical research frontier. Contemporary stereo vision backbones typically rely on either monocular depth estimation (MDE) models or visual foundation models (VFMs). Crucially, these models are predominantly pretrained without explicit supervision of camera poses. Given that such geometric knowledge is indispensable for stereo vision, the absence of explicit spatial constraints constitutes a significant performance bottleneck for existing architectures. Recognizing that the Visual Geometry Grounded Transformer (VGGT) operates as a foundation model pretrained on extensive 3D priors, including camera poses, we investigate its potential as a robust backbone for stereo vision tasks. Nevertheless, empirical results indicate that its direct application to stereo vision yields suboptimal performance. We observe that VGGT suffers from a more significant degradation of geometric details during feature extraction. Such characteristics conflict with the requirements of binocular stereo vision, thereby constraining its efficacy for relative tasks. To bridge this gap, we propose StereoVGGT, a feature backbone specifically tailored for stereo vision. By leveraging the frozen VGGT and introducing a training-free feature adjustment pipeline, we mitigate geometric degradation and harness the latent camera calibration knowledge embedded within the model. StereoVGGT-based stereo matching network achieved the $1^{st}$ rank among all published methods on the KITTI benchmark, validating that StereoVGGT serves as a highly effective backbone for stereo vision.

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as: arXiv:2603.29368 [cs.CV]

(or arXiv:2603.29368v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2603.29368

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Ziyang Chen [view email] [v1] Tue, 31 Mar 2026 07:41:07 UTC (4,909 KB)

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by Eigenvector · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

modeltransformerfoundation model

Knowledge Map

Knowledge Map
TopicsEntitiesSource
StereoVGGT:…modeltransformerfoundation …benchmarktrainingannouncearXiv cs.CV

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 214 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!