Customized Visual Storytelling with Unified Multimodal LLMs
arXiv:2603.27690v1 Announce Type: new Abstract: Multimodal story customization aims to generate coherent story flows conditioned on textual descriptions, reference identity images, and shot types. While recent progress in story generation has shown promising results, most approaches rely on text-only inputs. A few studies incorporate character identity cues (e.g., facial ID), but lack broader multimodal conditioning. In this work, we introduce VstoryGen, a multimodal framework that integrates descriptions with character and background references to enable customizable story generation. To enha — Wei-Hua Li, Cheng Sun, Chu-Song Chen
View PDF HTML (experimental)
Abstract:Multimodal story customization aims to generate coherent story flows conditioned on textual descriptions, reference identity images, and shot types. While recent progress in story generation has shown promising results, most approaches rely on text-only inputs. A few studies incorporate character identity cues (e.g., facial ID), but lack broader multimodal conditioning. In this work, we introduce VstoryGen, a multimodal framework that integrates descriptions with character and background references to enable customizable story generation. To enhance cinematic diversity, we introduce shot-type control via parameter-efficient prompt tuning on movie data, enabling the model to generate sequences that more faithfully reflect cinematic grammar. To evaluate our framework, we establish two new benchmarks that assess multimodal story customization from the perspectives of character and scene consistency, text-visual alignment, and shot-type control. Experiments demonstrate that VstoryGen achieves improved consistency and cinematic diversity compared to existing methods.
Comments: Paper accepted to the CVPR 2026 Workshop on Generative AI for Storytelling (CVPRW)
Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Cite as: arXiv:2603.27690 [cs.CV]
(or arXiv:2603.27690v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2603.27690
arXiv-issued DOI via DataCite (pending registration)
Submission history
From: WeiHua Li [view email] [v1] Sun, 29 Mar 2026 13:24:51 UTC (11,165 KB)
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
researchpaperarxiv
Industry Practitioners Perspectives on AI Model Quality: Perceptions, Challenges, and Solutions
arXiv:2402.16391v2 Announce Type: replace Abstract: Artificial Intelligence (AI) is now used across nearly every industry, making AI model quality essential for building reliable and trustworthy systems. Historically, correctness has been the main focus, but industry AI models must also satisfy many other important quality attributes. To understand how these attributes are perceived, the challenges they create, and the solutions used in practice, we identify nine key quality attributes and interview 15 AI practitioners from diverse backgrounds. The interviews show that practitioners prioritize attributes differently depending on context. For example, efficiency can matter more than correctness in real-time applications, while scalability and deployability are no longer seen as primary conc

Proceedings of the 7th Workshop on Models for Formal Analysis of Real Systems
arXiv:2604.03053v1 Announce Type: cross Abstract: These proceedings contain the papers that were presented at the 7th Workshop on Models for Formal Analysis of Real Systems (MARS 2026), which took place on 12 April 2026 in Turin, Italy, as a satellite event of the 29th International Joint Conferences on Theory and Practice of Software (ETAPS 2026). The goal of MARS is to bring together researchers from different communities who are developing formal models of real systems in areas where complex models occur (e.g., networks, cyber-physical systems, hardware/software codesign, biology). The motivation for MARS stems from the following two observations: - Large case studies are essential to show that specification formalisms and modelling techniques are applicable to real systems, whereas man

Separating Oblivious and Adaptive Differential Privacy under Continual Observation
arXiv:2603.11029v2 Announce Type: replace-cross Abstract: We resolve an open question of Jain, Raskhodnikova, Sivakumar, and Smith (ICML 2023) by exhibiting a problem separating differential privacy under continual observation in the oblivious and adaptive settings. The continual observation (a.k.a. continual release) model formalizes privacy for streaming algorithms, where data is received over time and output is released at each time step. In the oblivious setting, privacy need only hold for data streams fixed in advance; in the adaptive setting, privacy is required even for streams that can be chosen adaptively based on the streaming algorithm's output. We describe the first explicit separation between the oblivious and adaptive settings. The problem showing this separation is based on
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Research Papers

Industry Practitioners Perspectives on AI Model Quality: Perceptions, Challenges, and Solutions
arXiv:2402.16391v2 Announce Type: replace Abstract: Artificial Intelligence (AI) is now used across nearly every industry, making AI model quality essential for building reliable and trustworthy systems. Historically, correctness has been the main focus, but industry AI models must also satisfy many other important quality attributes. To understand how these attributes are perceived, the challenges they create, and the solutions used in practice, we identify nine key quality attributes and interview 15 AI practitioners from diverse backgrounds. The interviews show that practitioners prioritize attributes differently depending on context. For example, efficiency can matter more than correctness in real-time applications, while scalability and deployability are no longer seen as primary conc

Proceedings of the 7th Workshop on Models for Formal Analysis of Real Systems
arXiv:2604.03053v1 Announce Type: cross Abstract: These proceedings contain the papers that were presented at the 7th Workshop on Models for Formal Analysis of Real Systems (MARS 2026), which took place on 12 April 2026 in Turin, Italy, as a satellite event of the 29th International Joint Conferences on Theory and Practice of Software (ETAPS 2026). The goal of MARS is to bring together researchers from different communities who are developing formal models of real systems in areas where complex models occur (e.g., networks, cyber-physical systems, hardware/software codesign, biology). The motivation for MARS stems from the following two observations: - Large case studies are essential to show that specification formalisms and modelling techniques are applicable to real systems, whereas man

The Periodic Table of AI Architecture: Assigning Clear Roles to Scattered AI Findings
A speculative but highly insightful conceptual framework for AI architecture A Mini Textbook for AI Engineers on Structure, Flow, Trace, and Residual Governance.pdf just released on Open Science Framework for public review. This mini-textbook, with detail tutorial notes, offers a unified lens for thinking about intelligent systems — moving beyond “just scale more” toward structured coordination under real limits . It treats advanced AI not as an all-knowing predictor, but as bounded observers that extract stable structure from noisy reality while leaving a governable residual (ambiguity, fragility, and unresolved parts). At its core is a clean grammar built around: Maintained Structure vs. Active Flow Adjudication (separating the viable from the merely possible) Semantic time (event-define

‘This is 160-million-year-old Jurassic clay’: inside Es Devlin’s bid to reshape AI ethics – through pottery
The great artist and designer has summoned spiritual leaders, AI researchers and academics to try their hands at ceramics – and debate their wide-ranging positions on where tech is taking humanity Es Devlin owns a really great bell. It’s a singing bowl – originally used in Buddhist chanting rituals but now found in most quality yoga classes. This particular bell hits just the right frequency to make my temples vibrate pleasantly and, from the way the others gathered around the workbench at Oxford Kilns fall silent when Devlin strikes it, I don’t think I’m alone in feeling my head go ping. Devlin is calling order on a group of artists, AI researchers, spiritual leaders, academics and experts from global tech gathered at the kilns to discuss AI and make pots at the AI and Earth conference or


Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!