Live
Black Hat USAAI BusinessBlack Hat AsiaAI BusinessDesktop Canary v2.1.48-canary.31LobeChat ReleasesThe Invisible Broken Clock in AI Video GenerationHackernoon AIMean field sequence: an introductionLessWrong AISwift package AI inference engine generated from Rust crateHacker News AI TopZeta-2 Turns Code Edits Into Context-Aware Rewrite SuggestionsHackernoon AIAI Tools That Actually Pay You Back: A Developer's Guide to Monetizing AIDev.to AIThe $6 Million Shockwave: How DeepSeek Just Broke the AI MonopolyMedium AIHow I Got My First Freelance Client in 3 Days (Using AI) — Beginner Guide (India 2026)Medium AIWhy Your Resume Gets Rejected Before a Human Sees It (And How to Fix It)Dev.to AII've Been Saying RAG Is Dead Since 2020Medium AIAI Print-on-Demand Passive Income: ₹400-2K/Design from HomeDev.to AIFaceless YouTube Automation with AI: Complete Guide 2026Dev.to AIBlack Hat USAAI BusinessBlack Hat AsiaAI BusinessDesktop Canary v2.1.48-canary.31LobeChat ReleasesThe Invisible Broken Clock in AI Video GenerationHackernoon AIMean field sequence: an introductionLessWrong AISwift package AI inference engine generated from Rust crateHacker News AI TopZeta-2 Turns Code Edits Into Context-Aware Rewrite SuggestionsHackernoon AIAI Tools That Actually Pay You Back: A Developer's Guide to Monetizing AIDev.to AIThe $6 Million Shockwave: How DeepSeek Just Broke the AI MonopolyMedium AIHow I Got My First Freelance Client in 3 Days (Using AI) — Beginner Guide (India 2026)Medium AIWhy Your Resume Gets Rejected Before a Human Sees It (And How to Fix It)Dev.to AII've Been Saying RAG Is Dead Since 2020Medium AIAI Print-on-Demand Passive Income: ₹400-2K/Design from HomeDev.to AIFaceless YouTube Automation with AI: Complete Guide 2026Dev.to AI
AI NEWS HUBbyEIGENVECTOREigenvector

VistaGEN: Consistent Driving Video Generation with Fine-Grained Control Using Multiview Visual-Language Reasoning

arXivMarch 31, 20262 min read1 views
Source Quiz

arXiv:2603.28353v1 Announce Type: new Abstract: Driving video generation has achieved much progress in controllability, video resolution, and length, but fails to support fine-grained object-level controllability for diverse driving videos, while preserving the spatiotemporal consistency, especially in long video generation. In this paper, we present a new driving video generation technique, called VistaGEN, which enables fine-grained control of specific entities, including 3D objects, images, and text descriptions, while maintaining spatiotemporal consistency in long video sequences. Our key — Li-Heng Chen, Ke Cheng, Yahui Liu, Lei Shi, Shi-Sheng Huang, Hongbo Fu

View PDF HTML (experimental)

Abstract:Driving video generation has achieved much progress in controllability, video resolution, and length, but fails to support fine-grained object-level controllability for diverse driving videos, while preserving the spatiotemporal consistency, especially in long video generation. In this paper, we present a new driving video generation technique, called VistaGEN, which enables fine-grained control of specific entities, including 3D objects, images, and text descriptions, while maintaining spatiotemporal consistency in long video sequences. Our key innovation is the incorporation of multiview visual-language reasoning into the long driving video generation. To this end, we inject visual-language features into a multiview video generator to enable fine-grained controllability. More importantly, we propose a multiview vision-language evaluator (MV-VLM) to intelligently and automatically evaluate spatiotemporal consistency of the generated content, thus formulating a novel generation-evaluation-regeneration closed-loop generation mechanism. This mechanism ensures high-quality, coherent outputs, facilitating the creation of complex and reliable driving scenarios. Besides, within the closed-loop generation, we introduce an object-level refinement module to refine the unsatisfied results evaluated from the MV-VLM and then feed them back to the video generator for regeneration. Extensive evaluation shows that our VistaGEN achieves diverse driving video generation results with fine-grained controllability, especially for long-tail objects, and much better spatiotemporal consistency than previous approaches.

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as: arXiv:2603.28353 [cs.CV]

(or arXiv:2603.28353v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2603.28353

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Li-Heng Chen [view email] [v1] Mon, 30 Mar 2026 12:22:22 UTC (31,206 KB)

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by Eigenvector · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

researchpaperarxiv

Knowledge Map

Knowledge Map
TopicsEntitiesSource
VistaGEN: C…researchpaperarxivcomputer-vi…image-recog…arXiv

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 145 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!

More in Research Papers