Research Papers research paper arxiv computer-vision image-recognition

LogiStory: A Logic-Aware Framework for Multi-Image Story Visualization

arXivMarch 31, 20262 min read0 views

arXiv:2603.28082v1 Announce Type: new Abstract: Generating coherent and communicative visual sequences, such as image sequences and videos, remains a significant challenge for current multimodal systems. Despite advances in visual quality and the integration of world knowledge, existing models still struggle to maintain logical flow, often resulting in disjointed actions, fragmented narratives, and unclear storylines. We attribute these issues to the lack of attention to visual logic, a critical yet underexplored dimension of visual sequence generation that we define as the perceptual and caus — Chutian Meng, Fan Ma, Chi Zhang, Jiaxu Miao, Yi Yang, Yueting Zhuang

View PDF HTML (experimental)

Abstract:Generating coherent and communicative visual sequences, such as image sequences and videos, remains a significant challenge for current multimodal systems. Despite advances in visual quality and the integration of world knowledge, existing models still struggle to maintain logical flow, often resulting in disjointed actions, fragmented narratives, and unclear storylines. We attribute these issues to the lack of attention to visual logic, a critical yet underexplored dimension of visual sequence generation that we define as the perceptual and causal coherence among characters, actions, and scenes over time. To bridge this gap, we propose a logic-aware multi-image story visualization framework, LogiStory. The framework is built around the central innovation of explicitly modeling visual logic in story visualization. To realize this idea, we design a multi-agent system that grounds roles, extracts causal chains, and verifies story-level consistency, transforming narrative coherence from an implicit byproduct of image generation into an explicit modeling objective. This design effectively bridges structured story planning with visual generation, enhancing both narrative clarity and visual quality in story visualization. Furthermore, to evaluate the generation capacity, we construct LogicTale, a benchmark comprising richly annotated stories, emphasizing causal reasoning, and visual logic interpretability. We establish comprehensive automatic and human evaluation protocols designed to measure both visual logic and perceptual quality. Experiments demonstrate that our approach significantly improves the narrative logic of generated visual stories. This work provides a foundational step towards modeling and enforcing visual logic in general image sequence and video generation tasks.

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA)

Cite as: arXiv:2603.28082 [cs.CV]

(or arXiv:2603.28082v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2603.28082

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Chutian Meng [view email] [v1] Mon, 30 Mar 2026 06:37:12 UTC (15,243 KB)

Original source

arXiv

https://arxiv.org/abs/2603.28082

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

researchpaperarxiv

Laws & RegulationLive

Announcing Doublehaven with Reflections on Humour

Inkhaven is a writers’ retreat, well, really it’s a bloggers’ retreat. In the Lighthaven campus, Berkeley, a couple dozen bloggers get together to complete an almost insurmountable challenge for us mere mortals. Post one blogpost every single day for a whole month. I say ‘insurmountable’ but in fact they all succeeded last time, although apparently it was not uncommon for them to claw success from the jaws of defeat at 11:45 pm each night. I look at this and I feel the same way that traditionalists feel when they see Millennials scared to use the phone, or Gen Zs unable to go outside. Our (blogosphere) ancestors used to blog seventy times per day! Great Yudkowsky used to go to war (with the methods of rationality)! Moldbug and Alexander were gunning each other down (with devastating couter

LessWrong AI

6m5 minutes ago

ModelsRecent

Exclusive | Caltech Researchers Claim Radical Compression of High-Fidelity AI Models - WSJ

<a href="https://news.google.com/rss/articles/CBMiuANBVV95cUxPcTcxTmw1UlFVQXBQSVVIMDNJdTIxaVBoaTRBZUhvM0xBUmZGUlZpU0VPTFEyR0wwQ2wwUS1LQU5vUHg4dEFOQnpJbG1TbnJLTWRVRXpqM2ZCMUZ6dVNTZ243RnZxUTk1SHRPNWkweThSdUhydlFVNTVJUENNYm1QWXc1OFl1cDZLaUhqSEhicmNWc2Vlb3NrZjJpSy1jcmpSS01KdTYwZ3dJZlprVkdnNERmV1JBMU5YS0lUcmE0dGhTWG4ydUs0S0x4Z3kyd1NhTGFFd0N5UzVrVGJhS0VoTVdRWFp4Z0RLcVhVSHdEZklxalgzWlIwTDdJSGd1VlJsc0djTFdhVGU2VW5OYlFVajRsWlVPejVoOE9zdFZmTVQ3WGhCa2Z4RGFtVWo5cmNRak0yU3lISVMxRVgtVHctY0plVDFjTE1IV1V0S244elc2R21IdHVmbTB1VFRMRndqdDlibmY0ZWV5X2tVaUhuOVFVRGFjVm5EYmlxenU0a0RtZE4wX19nVVBodUNBck5ITURORG1JdkV6M19ia2pseXhNSDVaWHZ5c0V5R2xudUd3aUJUUkFRQg?oc=5" target="_blank">Exclusive | Caltech Researchers Claim Radical Compression of High-Fidelity AI Models</a> WSJ

Google News: LLM

1mabout 15 hours ago

ModelsRecent

Exclusive | Caltech Researchers Claim Radical Compression of High-Fidelity AI Models - WSJ

<a href="https://news.google.com/rss/articles/CBMiuANBVV95cUxPZVppQTVFSV9BaGFBMU9hWGlGS3ltTFdJZ3ZEREdzVkxBT2pSR2VaXy1QbEFEWkIyeEJSbmJXMWpoNnJWVXJiUWtRRlh1SC00anVxOERKcHlIOU95bjdQRktMbnVsOFVkSnBnVUVIV19uOFRJOVNDM3BmSXlrd0pqNHAwOWdua0VhX1BfMWxScnlGaEFNVUlRczJMTVdfa1hSNlNLSU11d2hMTXNqWlBVdUJLNmpDajk5a3RoaW1uam1TZW1IYTB5eUd3MHZWNUFPUWIzc2VIUU9lTTVTWVhub3VKVVJFTExqa1k0NWlXMFBYOEdIWXE0RV9ZbFFGazJhZVJLUGEwNmpMWWx1X2xRYXA2LU9HbjNFZ0h4WU1ZWmhGeEdSbGZQXzRIaWR2TlpPNWJ6dTNRN1NyQmRMdVNFX3F6ay0xYWNEUlU2MDJkSGU4ZXBnLTllR0hYbTZjM0lpUjI3NklvaVpDS1hDZjBIQ01DV1dUd3F6UzVta3JtNV9UOF9MV2NrRUxsbVdZemx6aEMwU3FJcVFuQmVjVHlDNWRjU2lBWm1aVzJMd3dfNFpzV2R2VHZHSg?oc=5" target="_blank">Exclusive | Caltech Researchers Claim Radical Compression of High-Fidelity AI Models</a> WSJ

Google News: LLM

1mabout 15 hours ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 226 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Research Papers

Research Papers

Hidden Helpers: Pittsburgh’s Industrial Past Might Hold the Key to a Cleaner Future

<img loading="lazy" src="https://www.cmu.edu/news/sites/default/files/styles/listings_desktop_1x_/public/2026-03/260305B_WTM_Armbruster038.jpg.webp?itok=8RGXrI_N" width="900" height="508" alt="Researchers examine soil"> Pittsburgh has reinvented itself from a steel powerhouse to a hub for health care and education. But the city’s industrial past left a hidden legacy: toxic compounds like benzene and toluene in the soil. While most life can’t survive such a contamination, some microbes adapted to use the pollutants as food.

Carnegie Mellon News

1m12 days ago

Research PapersFresh

XR is XR: Rethinking MR and XR as Neutral Umbrella Terms

arXiv:2603.29939v1 Announce Type: new Abstract: The term XR is currently widely used as an expression encompassing Virtual Reality (VR), Augmented Reality (AR), and Mixed Reality (MR). However, there is no clear consensus regarding its origin or meaning. XR is sometimes explained as an abbreviation for Extended Reality, but multiple interpretations exist regarding its etymology and formation process. This paper organizes the historical formation of terminology related to VR, AR, MR, and XR, and reexamines the context in which the term XR emerged and how it has spread. In particular, by presenting a timeline that distinguishes between the coinage of terms and the drivers of their adoption, we suggest that XR, as an umbrella term, functions not as an abbreviation of Extended Reality, but rat

arXiv cs.HC

1mabout 5 hours ago

Research PapersFresh

Interview-Informed Generative Agents for Product Discovery: A Validation Study

arXiv:2603.29890v1 Announce Type: new Abstract: Large language models (LLMs) have shown strong performance on standardized social science instruments, but their value for product discovery remains unclear. We investigate whether interview-informed generative agents can simulate user responses in concept testing scenarios. Using in-depth workflow interviews with knowledge workers, we created personalized agents and compared their evaluations of novel AI concepts against the same participants' responses. Our results show that agents are distribution-calibrated but identity-imprecise: they fail to replicate the specific individual they are grounded in, yet approximate population-level response distributions. These findings highlight both the potential and the limits of LLM simulation in desig

arXiv cs.HC

1mabout 5 hours ago

Research PapersFresh

Beyond Legacy OFDM: A Mobility-Adaptive Multi-Gear Framework for 6G

arXiv:2603.29721v1 Announce Type: new Abstract: While Third Generation Partnership Project (3GPP) has confirmed orthogonal frequency division multiplexing (OFDM) as the baseline waveform for sixth-generation (6G), its performance is severely compromised in the high-mobility scenarios envisioned for 6G. Building upon the GEARBOX-PHY vision, we present gear-switching OFDM (GS-OFDM): a unified framework in which the base station (BS) adaptively selects among three gears, ranging from legacy OFDM to delay-Doppler domain processing based on the channel mobility conditions experienced by the user equipments (UEs). We illustrate the benefit of adaptive gear switching for communication throughput and, finally, we conclude with an outlook on research challenges and opportunities.

arXiv eess.SP

1mabout 5 hours ago