Research Papers research paper arxiv computer-vision image-recognition

Wan-Weaver: Interleaved Multi-modal Generation via Decoupled Training

arXivby [Submitted on 26 Mar 2026]March 26, 20262 min read1 views

Recent unified models have made unprecedented progress in both understanding and generation. However, while most of them accept multi-modal inputs, they typically produce only single-modality outputs. This challenge of producing interleaved content is mainly due to training data scarcity and the difficulty of modeling long-range cross-modal context. To address this issue, we decompose interleaved generation into textual planning and visual consistency modeling, and introduce a framework consisting of a planner and a visualizer. The planner produces dense textual descriptions for visual content — Jinbo Xing, Zeyinzi Jiang, Yuxiang Tuo

Authors:Jinbo Xing, Zeyinzi Jiang, Yuxiang Tuo, Chaojie Mao, Xiaotang Gai, Xi Chen, Jingfeng Zhang, Yulin Pan, Zhen Han, Jie Xiao, Keyu Yan, Chenwei Xie, Chongyang Zhong, Kai Zhu, Tong Shen, Lianghua Huang, Yu Liu, Yujiu Yang

View PDF HTML (experimental)

Abstract:Recent unified models have made unprecedented progress in both understanding and generation. However, while most of them accept multi-modal inputs, they typically produce only single-modality outputs. This challenge of producing interleaved content is mainly due to training data scarcity and the difficulty of modeling long-range cross-modal context. To address this issue, we decompose interleaved generation into textual planning and visual consistency modeling, and introduce a framework consisting of a planner and a visualizer. The planner produces dense textual descriptions for visual content, while the visualizer synthesizes images accordingly. Under this guidance, we construct large-scale textual-proxy interleaved data (where visual content is represented in text) to train the planner, and curate reference-guided image data to train the visualizer. These designs give rise to Wan-Weaver, which exhibits emergent interleaved generation ability with long-range textual coherence and visual consistency. Meanwhile, the integration of diverse understanding and generation data into planner training enables Wan-Weaver to achieve robust task reasoning and generation proficiency. To assess the model's capability in interleaved generation, we further construct a benchmark that spans a wide range of use cases across multiple dimensions. Extensive experiments demonstrate that, even without access to any real interleaved data, Wan-Weaver achieves superior performance over existing methods.

Comments: CVPR 2026 Camera-ready, Webpage: this https URL

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as: arXiv:2603.25706 [cs.CV]

(or arXiv:2603.25706v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2603.25706

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Jinbo Xing [view email] [v1] Thu, 26 Mar 2026 17:50:37 UTC (21,809 KB)

Original source

arXiv

https://arxiv.org/abs/2603.25706v1

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

researchpaperarxiv

ReleasesFresh

Delaunay Canopy: Building Wireframe Reconstruction from Airborne LiDAR Point Clouds via Delaunay Graph

arXiv:2604.02497v1 Announce Type: new Abstract: Reconstructing building wireframe from airborne LiDAR point clouds yields a compact, topology-centric representation that enables structural understanding beyond dense meshes. Yet a key limitation persists: conventional methods have failed to achieve accurate wireframe reconstruction in regions afflicted by significant noise, sparsity, or internal corners. This failure stems from the inability to establish an adaptive search space to effectively leverage the rich 3D geometry of large, sparse building point clouds. In this work, we address this challenge with Delaunay Canopy, which utilizes the Delaunay graph as a geometric prior to define a geometrically adaptive search space. Central to our approach is Delaunay Graph Scoring, which not only

arXiv cs.CV

1mabout 2 hours ago

ModelsFresh

SocioEval: A Template-Based Framework for Evaluating Socioeconomic Status Bias in Foundation Models

arXiv:2604.02660v1 Announce Type: new Abstract: As Large Language Models (LLMs) increasingly power decision-making systems across critical domains, understanding and mitigating their biases becomes essential for responsible AI deployment. Although bias assessment frameworks have proliferated for attributes such as race and gender, socioeconomic status bias remains significantly underexplored despite its widespread implications in the real world. We introduce SocioEval, a template-based framework for systematically evaluating socioeconomic bias in foundation models through decision-making tasks. Our hierarchical framework encompasses 8 themes and 18 topics, generating 240 prompts across 6 class-pair combinations. We evaluated 13 frontier LLMs on 3,120 responses using a rigorous three-stage

arXiv cs.CL

1mabout 2 hours ago

Research PapersFresh

Speaking of Language: Reflections on Metalanguage Research in NLP

arXiv:2604.02645v1 Announce Type: new Abstract: This work aims to shine a spotlight on the topic of metalanguage. We first define metalanguage, link it to NLP and LLMs, and then discuss our two labs' metalanguage-centered efforts. Finally, we discuss four dimensions of metalanguage and metalinguistic tasks, offering a list of understudied future research directions.

arXiv cs.CL

1mabout 2 hours ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 213 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Research Papers

Research PapersFresh

Let's Have a Conversation: Designing and Evaluating LLM Agents for Interactive Optimization

arXiv:2604.02666v1 Announce Type: new Abstract: Optimization is as much about modeling the right problem as solving it. Identifying the right objectives, constraints, and trade-offs demands extensive interaction between researchers and stakeholders. Large language models can empower decision-makers with optimization capabilities through interactive optimization agents that can propose, interpret and refine solutions. However, it is fundamentally harder to evaluate a conversation-based interaction than traditional one-shot approaches. This paper proposes a scalable and replicable methodology for evaluating optimization agents through conversations. We build LLM-powered decision agents that role-play diverse stakeholders, each governed by an internal utility function but communicating like a

arXiv cs.AI

2mabout 2 hours ago

Research PapersFresh

Speaking of Language: Reflections on Metalanguage Research in NLP

arXiv cs.CL

1mabout 2 hours ago

Research PapersFresh

Neural correlates of perceptual consciousness from within: a narrative review of human intracranial research

arXiv:2510.08736v2 Announce Type: replace Abstract: Despite many years of research, the quest to identify neural correlates of perceptual consciousness (NCC) remains unresolved. One major obstacle lies in methodological limitations: most studies rely on non-invasive neural measures with limited spatial or temporal resolution making it difficult to disentangle proper NCCs from concurrent cognitive processes. Additionally, the relatively low sensitivity of non-invasive neural measures limits the interpretation of null findings in studies targeting proper NCCs. In this review, we discuss how human intracranial recordings can advance the search for NCCs, by offering high spatiotemporal resolution, improved signal sensitivity, and broad cortical and subcortical coverage. We review studies that

arXiv q-bio.NC

1mabout 2 hours ago

Research PapersFresh

Size-structured populations with growth fluctuations: Feynman--Kac formula and decoupling

arXiv:2508.14680v2 Announce Type: replace-cross Abstract: We study a size-structured population model in which individual cells grow at a rate determined by a fluctuating internal variable (e.g., gene expression levels). Many previous models of phenotypically heterogeneous populations can be viewed as special cases of this model, and it has previously been observed that the internal variable decouples from cell size under certain conditions. In this work, we generalize these results and connect them to the Feynman-Kac formula, which yields relationships between the lineage dynamics and population distribution in branching processes. To this end, we derive conditions for decoupling, both in the lineage and population ensemble. When decoupling occurs in both ensembles, the size dynamics can

arXiv physics.data-an

1mabout 2 hours ago