Live
Black Hat USADark ReadingBlack Hat AsiaAI BusinessFive Agent Memory Types in LangGraph: A Deep Code Walkthrough (Part 2)DEV CommunityLayered Context Routing for Campus Operations: A Facilities Intake PoCDEV CommunityHow Crypto Lending Actually Works Under the Hood: A Developer's PerspectiveDEV CommunityAutomating Landed Cost: The AI Advantage for ASEAN SellersDEV CommunityAsync Web Scraping in Python: httpx + asyncio for 10x Faster Data CollectionDEV CommunityUsing GPT-4 and Claude to Extract Structured Data From Any Webpage in 2026DEV CommunityBuilding Cross-Cloud Java Applications with Capa-Java: The Good, The Bad, and What I Learned the Hard WayDEV CommunityUBTECH 2025 "Report Card": Revenue from Full-Size Humanoid Robots Grows Over 22-Fold - GasgooGoogle News - AI roboticsI Built an MCP Server So Claude Can Answer Questions About Its Own UsageDEV CommunityAI Image Generation in 2026: A Developer's Guide to Building with AI Art APIsDEV CommunityUnder the Skin of America’s Humanoid Robots: Chinese Technology - WSJGoogle News - AI roboticsHow I Built a Zero-Signup AI Platform (And Why It Converts Better)DEV CommunityBlack Hat USADark ReadingBlack Hat AsiaAI BusinessFive Agent Memory Types in LangGraph: A Deep Code Walkthrough (Part 2)DEV CommunityLayered Context Routing for Campus Operations: A Facilities Intake PoCDEV CommunityHow Crypto Lending Actually Works Under the Hood: A Developer's PerspectiveDEV CommunityAutomating Landed Cost: The AI Advantage for ASEAN SellersDEV CommunityAsync Web Scraping in Python: httpx + asyncio for 10x Faster Data CollectionDEV CommunityUsing GPT-4 and Claude to Extract Structured Data From Any Webpage in 2026DEV CommunityBuilding Cross-Cloud Java Applications with Capa-Java: The Good, The Bad, and What I Learned the Hard WayDEV CommunityUBTECH 2025 "Report Card": Revenue from Full-Size Humanoid Robots Grows Over 22-Fold - GasgooGoogle News - AI roboticsI Built an MCP Server So Claude Can Answer Questions About Its Own UsageDEV CommunityAI Image Generation in 2026: A Developer's Guide to Building with AI Art APIsDEV CommunityUnder the Skin of America’s Humanoid Robots: Chinese Technology - WSJGoogle News - AI roboticsHow I Built a Zero-Signup AI Platform (And Why It Converts Better)DEV Community
AI NEWS HUBbyEIGENVECTOREigenvector

HD-VGGT: High-Resolution Visual Geometry Transformer

arXivMarch 31, 20262 min read0 views
Source Quiz

arXiv:2603.27222v1 Announce Type: new Abstract: High-resolution imagery is essential for accurate 3D reconstruction, as many geometric details only emerge at fine spatial scales. Recent feed-forward approaches, such as the Visual Geometry Grounded Transformer (VGGT), have demonstrated the ability to infer scene geometry from large collections of images in a single forward pass. However, scaling these models to high-resolution inputs remains challenging: the number of tokens in transformer architectures grows rapidly with both image resolution and the number of views, leading to prohibitive com — Tianrun Chen, Yuanqi Hu, Yidong Han, Hanjie Xu, Deyi Ji, Qi Zhu, Chunan Yu, Xin Zhang, Cheng Chen, Chaotao Ding, Ying Zang, Xuanfu Li, Jin Ma, Lanyun Zhu

Authors:Tianrun Chen, Yuanqi Hu, Yidong Han, Hanjie Xu, Deyi Ji, Qi Zhu, Chunan Yu, Xin Zhang, Cheng Chen, Chaotao Ding, Ying Zang, Xuanfu Li, Jin Ma, Lanyun Zhu

View PDF HTML (experimental)

Abstract:High-resolution imagery is essential for accurate 3D reconstruction, as many geometric details only emerge at fine spatial scales. Recent feed-forward approaches, such as the Visual Geometry Grounded Transformer (VGGT), have demonstrated the ability to infer scene geometry from large collections of images in a single forward pass. However, scaling these models to high-resolution inputs remains challenging: the number of tokens in transformer architectures grows rapidly with both image resolution and the number of views, leading to prohibitive computational and memory costs. Moreover, we observe that visually ambiguous regions, such as repetitive patterns, weak textures, or specular surfaces, often produce unstable feature tokens that degrade geometric inference, especially at higher resolutions. We introduce HD-VGGT, a dual-branch architecture for efficient and robust high-resolution 3D reconstruction. A low-resolution branch predicts a coarse, globally consistent geometry, while a high-resolution branch refines details via a learned feature upsampling module. To handle unstable tokens, we propose Feature Modulation, which suppresses unreliable features early in the transformer. HD-VGGT leverages high-resolution images and supervision without full-resolution transformer costs, achieving state-of-the-art reconstruction quality.

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as: arXiv:2603.27222 [cs.CV]

(or arXiv:2603.27222v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2603.27222

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Deyi Ji [view email] [v1] Sat, 28 Mar 2026 10:29:07 UTC (3,398 KB)

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by Eigenvector · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

Knowledge Map

Knowledge Map
TopicsEntitiesSource
HD-VGGT: Hi…researchpaperarxivcomputer-vi…image-recog…arXiv

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 166 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!