Live
Black Hat USADark ReadingBlack Hat AsiaAI BusinessGoogle Study: AI Benchmarks Use Too Few Raters to Be Reliable - WinBuzzerGNews AI benchmarkNvidia Stock Rises. This Issue Could Hamper Its Next-Generation AI Chips. - Barron'sGNews AI NVIDIABroadcom's CEO Has Line of Sight to $100 Billion in AI Chip Revenue. Is the Stock a Buy? - The Motley FoolGoogle News: AI‘This is 160-million-year-old Jurassic clay’: inside Es Devlin’s bid to reshape AI ethics – through potteryThe Guardian AI‘This is 160-million-year-old Jurassic clay’: inside Es Devlin’s bid to reshape AI ethics – through pottery - The GuardianGNews AI ethicsI gave Claude Code our entire codebase. Our customers noticed. | Al Chen (Galileo)lennysnewsletter.comGoogle DeepMind and Agile Robotics Combine Robotics Platforms - Automation WorldGoogle News: DeepMindRoche Launches AI Factory with NVIDIA to Accelerate Drug Discovery and Diagnostics - The Healthcare Technology Report.GNews AI NVIDIABig Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.Dev.to AIBuilding a Resume & Portfolio Platform with Next.js and ReactDev.to AIWhy AI-Powered Ecommerce Website Development Is the New Competitive Edge in 2026Dev.to AIFAQs on Visionary AI: Transforming the Future of InnovationDev.to AIBlack Hat USADark ReadingBlack Hat AsiaAI BusinessGoogle Study: AI Benchmarks Use Too Few Raters to Be Reliable - WinBuzzerGNews AI benchmarkNvidia Stock Rises. This Issue Could Hamper Its Next-Generation AI Chips. - Barron'sGNews AI NVIDIABroadcom's CEO Has Line of Sight to $100 Billion in AI Chip Revenue. Is the Stock a Buy? - The Motley FoolGoogle News: AI‘This is 160-million-year-old Jurassic clay’: inside Es Devlin’s bid to reshape AI ethics – through potteryThe Guardian AI‘This is 160-million-year-old Jurassic clay’: inside Es Devlin’s bid to reshape AI ethics – through pottery - The GuardianGNews AI ethicsI gave Claude Code our entire codebase. Our customers noticed. | Al Chen (Galileo)lennysnewsletter.comGoogle DeepMind and Agile Robotics Combine Robotics Platforms - Automation WorldGoogle News: DeepMindRoche Launches AI Factory with NVIDIA to Accelerate Drug Discovery and Diagnostics - The Healthcare Technology Report.GNews AI NVIDIABig Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.Dev.to AIBuilding a Resume & Portfolio Platform with Next.js and ReactDev.to AIWhy AI-Powered Ecommerce Website Development Is the New Competitive Edge in 2026Dev.to AIFAQs on Visionary AI: Transforming the Future of InnovationDev.to AI
AI NEWS HUBbyEIGENVECTOREigenvector

Customized Visual Storytelling with Unified Multimodal LLMs

arXivMarch 31, 20261 min read0 views
Source Quiz

arXiv:2603.27690v1 Announce Type: new Abstract: Multimodal story customization aims to generate coherent story flows conditioned on textual descriptions, reference identity images, and shot types. While recent progress in story generation has shown promising results, most approaches rely on text-only inputs. A few studies incorporate character identity cues (e.g., facial ID), but lack broader multimodal conditioning. In this work, we introduce VstoryGen, a multimodal framework that integrates descriptions with character and background references to enable customizable story generation. To enha — Wei-Hua Li, Cheng Sun, Chu-Song Chen

View PDF HTML (experimental)

Abstract:Multimodal story customization aims to generate coherent story flows conditioned on textual descriptions, reference identity images, and shot types. While recent progress in story generation has shown promising results, most approaches rely on text-only inputs. A few studies incorporate character identity cues (e.g., facial ID), but lack broader multimodal conditioning. In this work, we introduce VstoryGen, a multimodal framework that integrates descriptions with character and background references to enable customizable story generation. To enhance cinematic diversity, we introduce shot-type control via parameter-efficient prompt tuning on movie data, enabling the model to generate sequences that more faithfully reflect cinematic grammar. To evaluate our framework, we establish two new benchmarks that assess multimodal story customization from the perspectives of character and scene consistency, text-visual alignment, and shot-type control. Experiments demonstrate that VstoryGen achieves improved consistency and cinematic diversity compared to existing methods.

Comments: Paper accepted to the CVPR 2026 Workshop on Generative AI for Storytelling (CVPRW)

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as: arXiv:2603.27690 [cs.CV]

(or arXiv:2603.27690v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2603.27690

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: WeiHua Li [view email] [v1] Sun, 29 Mar 2026 13:24:51 UTC (11,165 KB)

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by Eigenvector · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

researchpaperarxiv

Knowledge Map

Knowledge Map
TopicsEntitiesSource
Customized …researchpaperarxivcomputer-vi…image-recog…arXiv

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 197 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!

More in Research Papers