Research Papers research paper arxiv computer-vision image-recognition

Wan-Weaver: Interleaved Multi-modal Generation via Decoupled Training

arXivMarch 31, 20262 min read0 views

arXiv:2603.25706v2 Announce Type: replace Abstract: Recent unified models have made unprecedented progress in both understanding and generation. However, while most of them accept multi-modal inputs, they typically produce only single-modality outputs. This challenge of producing interleaved content is mainly due to training data scarcity and the difficulty of modeling long-range cross-modal context. To address this issue, we decompose interleaved generation into textual planning and visual consistency modeling, and introduce a framework consisting of a planner and a visualizer. The planner pr — Jinbo Xing, Zeyinzi Jiang, Yuxiang Tuo, Chaojie Mao, Xiaotang Gai, Xi Chen, Jingfeng Zhang, Yulin Pan, Zhen Han, Jie Xiao, Keyu Yan, Chenwei Xie, Chongyang Zhong, Kai Zhu, Tong Shen, Lianghua Huang, Yu Liu, Yujiu Yang

Authors:Jinbo Xing, Zeyinzi Jiang, Yuxiang Tuo, Chaojie Mao, Xiaotang Gai, Xi Chen, Jingfeng Zhang, Yulin Pan, Zhen Han, Jie Xiao, Keyu Yan, Chenwei Xie, Chongyang Zhong, Kai Zhu, Tong Shen, Lianghua Huang, Yu Liu, Yujiu Yang

View PDF HTML (experimental)

Abstract:Recent unified models have made unprecedented progress in both understanding and generation. However, while most of them accept multi-modal inputs, they typically produce only single-modality outputs. This challenge of producing interleaved content is mainly due to training data scarcity and the difficulty of modeling long-range cross-modal context. To address this issue, we decompose interleaved generation into textual planning and visual consistency modeling, and introduce a framework consisting of a planner and a visualizer. The planner produces dense textual descriptions for visual content, while the visualizer synthesizes images accordingly. Under this guidance, we construct large-scale textual-proxy interleaved data (where visual content is represented in text) to train the planner, and curate reference-guided image data to train the visualizer. These designs give rise to Wan-Weaver, which exhibits emergent interleaved generation ability with long-range textual coherence and visual consistency. Meanwhile, the integration of diverse understanding and generation data into planner training enables Wan-Weaver to achieve robust task reasoning and generation proficiency. To assess the model's capability in interleaved generation, we further construct a benchmark that spans a wide range of use cases across multiple dimensions. Extensive experiments demonstrate that, even without access to any real interleaved data, Wan-Weaver achieves superior performance over existing methods.

Comments: CVPR 2026 Camera-ready, Webpage: this https URL

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as: arXiv:2603.25706 [cs.CV]

(or arXiv:2603.25706v2 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2603.25706

arXiv-issued DOI via DataCite

Submission history

From: Jinbo Xing [view email] [v1] Thu, 26 Mar 2026 17:50:37 UTC (21,809 KB) [v2] Mon, 30 Mar 2026 03:26:27 UTC (21,809 KB)

Original source

arXiv

https://arxiv.org/abs/2603.25706

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

researchpaperarxiv

Research Papers

Exclusive | OpenAI’s Former Research Chief Aims to Automate Manufacturing With AI - WSJ

Exclusive | OpenAI’s Former Research Chief Aims to Automate Manufacturing With AI WSJ

GNews AI manufacturing

1m30 days ago

ProductsFresh

Source: Anthropic has acquired Coefficient Bio, which was developing a platform that enables AI to run biotech tasks such as planning drug research, for ~$400M (The Information)

The Information : Source: Anthropic has acquired Coefficient Bio, which was developing a platform that enables AI to run biotech tasks such as planning drug research, for ~$400M Anthropic has acquired AI biotech startup Coefficient Bio for roughly $400 million, according to a person with knowledge of the deal.

Techmeme

1mabout 7 hours ago

ProductsFresh

Source Known Identifiers: A Three-Tier Identity System for Distributed Applications

arXiv:2604.00151v1 Announce Type: cross Abstract: Distributed applications need identifiers that satisfy storage efficiency, chronological sortability, origin metadata embedding, zero-lookup verifiability, confidentiality for external consumers, and multi-century addressability. Based on our literature survey, no existing scheme provides all six of these identifier properties within a unified system. This paper introduces Source Known Identifiers (SKIDs), a three-tier identity system that projects a single entity identity across trust boundaries, addressing all six properties. The first tier, Source Known ID (SKID), is a 64-bit signed integer embedding a timestamp with a 250-millisecond precision, application topology, and a per-entity-type sequence counter. It serves as the database prima

arXiv cs.SE

2mabout 3 hours ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 182 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Research Papers

Research Papers

Exclusive | OpenAI’s Former Research Chief Aims to Automate Manufacturing With AI - WSJ

Exclusive | OpenAI’s Former Research Chief Aims to Automate Manufacturing With AI WSJ

GNews AI manufacturing

1m30 days ago

Research PapersFresh

LLMs as Idiomatic Decompilers: Recovering High-Level Code from x86-64 Assembly for Dart

arXiv:2604.02278v1 Announce Type: new Abstract: Translating machine code into human-readable high-level languages is an open research problem in reverse engineering. Despite recent advancements in LLM-based decompilation to C, modern languages like Dart and Swift are unexplored. In this paper, we study the use of small specialized LLMs as an idiomatic decompiler for such languages. Additionally, we investigate the augmentation of training data using synthetic same-language examples, and compare it against adding human-written examples using related-language (Swift -> Dart). We apply CODEBLEU to evaluate the decompiled code readability and compile@k to measure the syntax correctness. Our experimental results show that on a 73-function Dart test dataset (representing diverse complexity level

arXiv cs.SE

2mabout 3 hours ago

Research PapersFresh

Fuzzing REST APIs in Industry: Necessary Features and Open Problems

arXiv:2604.01759v1 Announce Type: new Abstract: REST APIs are widely used in industry, in all different kinds of domains. An example is Volkswagen AG, a German automobile manufacturer. Established testing approaches for REST APIs are time consuming, and require expertise from professional test engineers. Due to its cost and importance, in the scientific literature several approaches have been proposed to automatically test REST APIs. The open-source, search-based fuzzer EvoMaster is one of such tools proposed in the academic literature. However, how academic prototypes can be integrated in industry and have real impact to software engineering practice requires more investigation. In this paper, we report on our experience in using EvoMaster at Volkswagen AG, as an EvoMaster user from 2023

arXiv cs.SE

1mabout 3 hours ago

Research PapersFresh

Triosecuris: Formally Verified Protection Against Speculative Control-Flow Hijacking

arXiv:2601.22978v2 Announce Type: replace-cross Abstract: This paper introduces Triosecuris, a formally verified defense against Spectre BTB, RSB, and PHT that combines CET-style hardware-assisted control-flow integrity with compiler-inserted speculative load hardening (SLH). Triosecuris is based on the novel observation that in the presence of CET-style protection, we can precisely detect BTB misspeculation for indirect calls and RSB misspeculation for returns and set the SLH misspeculation flag. We formalize Triosecuris as a transformation in Rocq and provide a machine-checked proof that it achieves relative security: any transformed program running with speculation leaks no more than what the source program leaks without speculation. This strong security guarantee applies to arbitrary p

arXiv cs.PL

1mabout 3 hours ago