Research Papers research paper arxiv computer-vision image-recognition

Seeing the Scene Matters: Revealing Forgetting in Video Understanding Models with a Scene-Aware Long-Video Benchmark

arXivMarch 31, 20262 min read0 views

arXiv:2603.27259v1 Announce Type: new Abstract: Long video understanding (LVU) remains a core challenge in multimodal learning. Although recent vision-language models (VLMs) have made notable progress, existing benchmarks mainly focus on either fine-grained perception or coarse summarization, offering limited insight into temporal understanding over long contexts. In this work, we define a scene as a coherent segment of a video in which both visual and semantic contexts remain consistent, aligning with human perception. This leads us to a key question: can current VLMs reason effectively over — Seng Nam Chen, Hao Chen, Chenglam Ho, Xinyu Mao, Jinping Wang, Yu Zhang, Chao Li

View PDF

Abstract:Long video understanding (LVU) remains a core challenge in multimodal learning. Although recent vision-language models (VLMs) have made notable progress, existing benchmarks mainly focus on either fine-grained perception or coarse summarization, offering limited insight into temporal understanding over long contexts. In this work, we define a scene as a coherent segment of a video in which both visual and semantic contexts remain consistent, aligning with human perception. This leads us to a key question: can current VLMs reason effectively over long, scene-level contexts? To answer this, we introduce a new benchmark, SceneBench, designed to provide scene-level challenges. Our evaluation reveals a sharp drop in accuracy when VLMs attempt to answer scene-level questions, indicating significant forgetting of long-range context. To further validate these findings, we propose Scene Retrieval-Augmented Generation (Scene-RAG), which constructs a dynamic scene memory by retrieving and integrating relevant context across scenes. This Scene-RAG improves VLM performance by +2.50%, confirming that current models still struggle with long-context retention. We hope SceneBench will encourage future research toward VLMs with more robust, human-like video comprehension.

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as: arXiv:2603.27259 [cs.CV]

(or arXiv:2603.27259v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2603.27259

arXiv-issued DOI via DataCite (pending registration)

Journal reference: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2026

Submission history

From: Hao Chen Calvin [view email] [v1] Sat, 28 Mar 2026 12:44:19 UTC (15,241 KB)

Original source

arXiv

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

researchpaperarxiv

ModelsRecent

ChatGPT acts as a "cognitive crutch" that weakens memory, new research suggests - PsyPost

<a href="https://news.google.com/rss/articles/CBMiowFBVV95cUxQTC13Zm5WZG9iQmRmZFpXM0ctamxRZ3E5N0ZFRDlIOWhHa2l1eTVmaTBzZFR6VGx6QjR2VEd0SFZoY0l4ZXVPSVF1c3FDTnE4Nk5zanNMWXhyLVpxVVlIUEZhZUFxXzYtQkRnM2E5eTN5M3NYUmJsX09YMTR6dWhLQ2hUSk55S2FJQXV4WFEzVFB4ZzhyS21RUzFoMDJzSXpQQ1pR?oc=5" target="_blank">ChatGPT acts as a "cognitive crutch" that weakens memory, new research suggests</a> PsyPost

Google News: ChatGPT

1m1 day ago

Models

Gemini Deep Think: Redefining the Future of Scientific Research - Google DeepMind

<a href="https://news.google.com/rss/articles/CBMipgFBVV95cUxPRmtMZnRYNW04a3Q4b0dSQm9aall0S3BJWFFOczQ3dmdfX3cyR1plYlotZHg5ekhlZ2s3cUd6Y1pyT3lkVEJrV1V0c0NWVlBQNHlMbXlEbXpTYWlSVUZHVllYZWdSb2RMU2JTelVGMXBEckZSdWt5VUs1d24zdUVLaExpS0NZMmtpSTRoNDd2MHRZdlBRaWxSWmVTNk0wRWtRQ2NaV2ln?oc=5" target="_blank">Gemini Deep Think: Redefining the Future of Scientific Research</a> Google DeepMind

Google News: DeepMind

1mabout 2 months ago

Research Papers

Alibaba Poaches Google DeepMind Research Scientist For Qwen AI Push - Yahoo Finance

<a href="https://news.google.com/rss/articles/CBMijwFBVV95cUxOYTZwZk0walRzazJQampab1FCM2k4Uy1SYk12UWZraENkUXYzZU9kbnlGTGZJS0pFaTZIUFlKZFkwVnJkRzhKbXhNV3lNdUZpdF8tSU1LMklqcTZlUDZERDZ3VzdWbjNQYUN4T2d2ZkRQT1R1MUc0LXdYNndPQTNzbXBXMXJhb3ZEZE00ZFMtaw?oc=5" target="_blank">Alibaba Poaches Google DeepMind Research Scientist For Qwen AI Push</a> Yahoo Finance

Google News: DeepMind

1m25 days ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 182 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Research Papers

Research Papers

Alibaba Poaches Google DeepMind Research Scientist For Qwen AI Push - Yahoo Finance

Google News: DeepMind

1m25 days ago

Research PapersLive

Is AI's visual understanding mostly a 'mirage'? New research suggests so. - Fortune

<a href="https://news.google.com/rss/articles/CBMihgFBVV95cUxORGxTdWF3bnBiU0VaUEVtanJCT1htWVdjTUo3UnJycUxKcl9HU3Q1ODNINW9na1R0aENXXzhGYnc0Qlg3aGFGM2hiTVNFSjBZQ2FPUElZYmVGdzhfU0d5QkR2cDVnSzJBd2Y5WEVMLUJfWHY4YUc5c1I1U1dUQW9TeU56U1JjQQ?oc=5" target="_blank">Is AI's visual understanding mostly a 'mirage'? New research suggests so.</a> Fortune

Google News: Machine Learning

1mabout 1 hour ago

Research PapersLive

Here's the severance package Oracle offered laid-off US employees

Severance includes 4 weeks base salary plus 1 week per year of employment.

Business Insider

1m11 minutes ago

Research Papers

71% of Businesses Are Invisible to AI - And Most Don't Know It Yet

Search didn't evolve - it got replaced. AI systems don't return links, they return answers. New research shows 71% of businesses are invisible to AI recommendation engines. Brand size doesn't matter. Vanguard scores 16/100 while ProtonMail scores 88. The playbook that won Google doesn't work here. Clarity beats clout. Read All

Hackernoon AI

1m8 days ago