Live
Black Hat USAAI BusinessBlack Hat AsiaAI BusinessThe AI-Powered Agency: A Developer Playbook for Selling AI Services in 2026Dev.to AIYour AI Chatbot Isn't Stupid. It Just Has No Memory. Here's How We Fixed That.Dev.to AIInternational RegLab Project reports on AI use in nuclear power plant operations - Nuclear Energy Agency (NEA)Google News: AIAI Agent Tools for Small Business Owners: A Practical GuideDev.to AINavigating the Quiet Rhythms of the Siuntio FortDev.to AIArtificial Intelligence in the Battle against Coronavirus (COVID-19): A Surveyand Future Research DirectionsDev.to AISoftware Testing Training in Kalyan Nagar – Learnmore TechnologiesDev.to AII'm 단아, Leader 36 of Lawmadi OS — Your AI Cultural Heritage & Religion Expert for Korean LawDev.to AIHow to Access All AI Models with a Single API Key in 2026Dev.to AIPRH Germany sues OpenAI for ‘copyright infringement’ of children’s series - The BooksellerGoogle News: OpenAIHow do I adapt my content for AI search?Dev.to AIEmail obfuscation: What works in 2026?!DEV CommunityBlack Hat USAAI BusinessBlack Hat AsiaAI BusinessThe AI-Powered Agency: A Developer Playbook for Selling AI Services in 2026Dev.to AIYour AI Chatbot Isn't Stupid. It Just Has No Memory. Here's How We Fixed That.Dev.to AIInternational RegLab Project reports on AI use in nuclear power plant operations - Nuclear Energy Agency (NEA)Google News: AIAI Agent Tools for Small Business Owners: A Practical GuideDev.to AINavigating the Quiet Rhythms of the Siuntio FortDev.to AIArtificial Intelligence in the Battle against Coronavirus (COVID-19): A Surveyand Future Research DirectionsDev.to AISoftware Testing Training in Kalyan Nagar – Learnmore TechnologiesDev.to AII'm 단아, Leader 36 of Lawmadi OS — Your AI Cultural Heritage & Religion Expert for Korean LawDev.to AIHow to Access All AI Models with a Single API Key in 2026Dev.to AIPRH Germany sues OpenAI for ‘copyright infringement’ of children’s series - The BooksellerGoogle News: OpenAIHow do I adapt my content for AI search?Dev.to AIEmail obfuscation: What works in 2026?!DEV Community
Eigenvector logo
AI NEWS HUBbyEIGENVECTOR

VideoARM: Agentic Reasoning over Hierarchical Memory for Long-Form Video Understanding

arXivby [Submitted on 13 Dec 2025 (v1), last revised 30 Mar 2026 (this version, v2)]March 31, 20262 min read0 views
Source Quiz

arXiv:2512.12360v2 Announce Type: replace-cross Abstract: Long-form video understanding remains challenging due to the extended temporal structure and dense multimodal cues. Despite recent progress, many existing approaches still rely on hand-crafted reasoning pipelines or employ token-consuming video preprocessing to guide MLLMs in autonomous reasoning. To overcome these limitations, we introduce VideoARM, an Agentic Reasoning-over-hierarchical-Memory paradigm for long-form video understanding. Instead of static, exhaustive preprocessing, VideoARM performs adaptive, on-the-fly agentic reasoni — Yufei Yin, Qianke Meng, Minghao Chen, Jiajun Ding, Zhenwei Shao, Zhou Yu

View PDF HTML (experimental)

Abstract:Long-form video understanding remains challenging due to the extended temporal structure and dense multimodal cues. Despite recent progress, many existing approaches still rely on hand-crafted reasoning pipelines or employ token-consuming video preprocessing to guide MLLMs in autonomous reasoning. To overcome these limitations, we introduce VideoARM, an Agentic Reasoning-over-hierarchical-Memory paradigm for long-form video understanding. Instead of static, exhaustive preprocessing, VideoARM performs adaptive, on-the-fly agentic reasoning and memory construction. Specifically, VideoARM performs an adaptive and continuous loop of observing, thinking, acting, and memorizing, where a controller autonomously invokes tools to interpret the video in a coarse-to-fine manner, thereby substantially reducing token consumption. In parallel, a hierarchical multimodal memory continuously captures and updates multi-level clues throughout the operation of the agent, providing precise contextual information to support the controller in decision-making. Experiments on prevalent benchmarks demonstrate that VideoARM outperforms the state-of-the-art method, DVD, while significantly reducing token consumption for long-form videos.

Comments: Accepted to CVPR 2026, code available at this https URL

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

Cite as: arXiv:2512.12360 [cs.CV]

(or arXiv:2512.12360v2 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2512.12360

arXiv-issued DOI via DataCite

Submission history

From: Zhou Yu [view email] [v1] Sat, 13 Dec 2025 15:11:03 UTC (8,330 KB) [v2] Mon, 30 Mar 2026 14:51:27 UTC (5,816 KB)

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by Eigenvector · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

Knowledge Map

Knowledge Map
TopicsEntitiesSource
VideoARM: A…researchpaperarxivnlplanguage-mo…arXiv

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 219 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!

More in Research Papers