Live
Black Hat USAAI BusinessBlack Hat AsiaAI BusinessAmazon, Apple, and Nvidia can't make AI chips without this company. Here's why its growth stock could soar. - MSNGNews AI NVIDIAI am building a Notebook Environment for SQL Inside a Database ClientDEV CommunityA Production Readiness Checklist for Remote MCP ServersDEV CommunityNginx + PHP + MySQL Optimisations and Parameter CalculationsDEV CommunityDo You Actually Need an AI Gateway? (And When a Simple LLM Wrapper Isn’t Enough)DEV CommunityPowerShell Scripts Every MSP Should UseDEV CommunityThe way I see it — The development of autonomous vehicles is fraught with ethical concerns. And: The notion that the separatiDev.to AIFull-Stack E-Commerce App - Part 1: Project setupDEV CommunityThe Architect’s Reflection: The 5D MiddlewareMedium AII Am a Software Engineer Teaching Myself AI Engineering. Here Is Where I Am Starting.Medium AIShow HN: AI tool to merge people from two photos into one realistic group photoHacker News AI Top20 Meta-Prompts That Boost AI Response Quality by 300%Dev.to AIBlack Hat USAAI BusinessBlack Hat AsiaAI BusinessAmazon, Apple, and Nvidia can't make AI chips without this company. Here's why its growth stock could soar. - MSNGNews AI NVIDIAI am building a Notebook Environment for SQL Inside a Database ClientDEV CommunityA Production Readiness Checklist for Remote MCP ServersDEV CommunityNginx + PHP + MySQL Optimisations and Parameter CalculationsDEV CommunityDo You Actually Need an AI Gateway? (And When a Simple LLM Wrapper Isn’t Enough)DEV CommunityPowerShell Scripts Every MSP Should UseDEV CommunityThe way I see it — The development of autonomous vehicles is fraught with ethical concerns. And: The notion that the separatiDev.to AIFull-Stack E-Commerce App - Part 1: Project setupDEV CommunityThe Architect’s Reflection: The 5D MiddlewareMedium AII Am a Software Engineer Teaching Myself AI Engineering. Here Is Where I Am Starting.Medium AIShow HN: AI tool to merge people from two photos into one realistic group photoHacker News AI Top20 Meta-Prompts That Boost AI Response Quality by 300%Dev.to AI
AI NEWS HUBbyEIGENVECTOREigenvector

Wan-Weaver: Interleaved Multi-modal Generation via Decoupled Training

arXivMarch 31, 20262 min read0 views
Source Quiz

arXiv:2603.25706v2 Announce Type: replace Abstract: Recent unified models have made unprecedented progress in both understanding and generation. However, while most of them accept multi-modal inputs, they typically produce only single-modality outputs. This challenge of producing interleaved content is mainly due to training data scarcity and the difficulty of modeling long-range cross-modal context. To address this issue, we decompose interleaved generation into textual planning and visual consistency modeling, and introduce a framework consisting of a planner and a visualizer. The planner pr — Jinbo Xing, Zeyinzi Jiang, Yuxiang Tuo, Chaojie Mao, Xiaotang Gai, Xi Chen, Jingfeng Zhang, Yulin Pan, Zhen Han, Jie Xiao, Keyu Yan, Chenwei Xie, Chongyang Zhong, Kai Zhu, Tong Shen, Lianghua Huang, Yu Liu, Yujiu Yang

Authors:Jinbo Xing, Zeyinzi Jiang, Yuxiang Tuo, Chaojie Mao, Xiaotang Gai, Xi Chen, Jingfeng Zhang, Yulin Pan, Zhen Han, Jie Xiao, Keyu Yan, Chenwei Xie, Chongyang Zhong, Kai Zhu, Tong Shen, Lianghua Huang, Yu Liu, Yujiu Yang

View PDF HTML (experimental)

Abstract:Recent unified models have made unprecedented progress in both understanding and generation. However, while most of them accept multi-modal inputs, they typically produce only single-modality outputs. This challenge of producing interleaved content is mainly due to training data scarcity and the difficulty of modeling long-range cross-modal context. To address this issue, we decompose interleaved generation into textual planning and visual consistency modeling, and introduce a framework consisting of a planner and a visualizer. The planner produces dense textual descriptions for visual content, while the visualizer synthesizes images accordingly. Under this guidance, we construct large-scale textual-proxy interleaved data (where visual content is represented in text) to train the planner, and curate reference-guided image data to train the visualizer. These designs give rise to Wan-Weaver, which exhibits emergent interleaved generation ability with long-range textual coherence and visual consistency. Meanwhile, the integration of diverse understanding and generation data into planner training enables Wan-Weaver to achieve robust task reasoning and generation proficiency. To assess the model's capability in interleaved generation, we further construct a benchmark that spans a wide range of use cases across multiple dimensions. Extensive experiments demonstrate that, even without access to any real interleaved data, Wan-Weaver achieves superior performance over existing methods.

Comments: CVPR 2026 Camera-ready, Webpage: this https URL

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as: arXiv:2603.25706 [cs.CV]

(or arXiv:2603.25706v2 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2603.25706

arXiv-issued DOI via DataCite

Submission history

From: Jinbo Xing [view email] [v1] Thu, 26 Mar 2026 17:50:37 UTC (21,809 KB) [v2] Mon, 30 Mar 2026 03:26:27 UTC (21,809 KB)

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by Eigenvector · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

researchpaperarxiv

Knowledge Map

Knowledge Map
TopicsEntitiesSource
Wan-Weaver:…researchpaperarxivcomputer-vi…image-recog…arXiv

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 182 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!

More in Research Papers