VideoARM: Agentic Reasoning over Hierarchical Memory for Long-Form Video Understanding
arXiv:2512.12360v2 Announce Type: replace-cross Abstract: Long-form video understanding remains challenging due to the extended temporal structure and dense multimodal cues. Despite recent progress, many existing approaches still rely on hand-crafted reasoning pipelines or employ token-consuming video preprocessing to guide MLLMs in autonomous reasoning. To overcome these limitations, we introduce VideoARM, an Agentic Reasoning-over-hierarchical-Memory paradigm for long-form video understanding. Instead of static, exhaustive preprocessing, VideoARM performs adaptive, on-the-fly agentic reasoni — Yufei Yin, Qianke Meng, Minghao Chen, Jiajun Ding, Zhenwei Shao, Zhou Yu
View PDF HTML (experimental)
Abstract:Long-form video understanding remains challenging due to the extended temporal structure and dense multimodal cues. Despite recent progress, many existing approaches still rely on hand-crafted reasoning pipelines or employ token-consuming video preprocessing to guide MLLMs in autonomous reasoning. To overcome these limitations, we introduce VideoARM, an Agentic Reasoning-over-hierarchical-Memory paradigm for long-form video understanding. Instead of static, exhaustive preprocessing, VideoARM performs adaptive, on-the-fly agentic reasoning and memory construction. Specifically, VideoARM performs an adaptive and continuous loop of observing, thinking, acting, and memorizing, where a controller autonomously invokes tools to interpret the video in a coarse-to-fine manner, thereby substantially reducing token consumption. In parallel, a hierarchical multimodal memory continuously captures and updates multi-level clues throughout the operation of the agent, providing precise contextual information to support the controller in decision-making. Experiments on prevalent benchmarks demonstrate that VideoARM outperforms the state-of-the-art method, DVD, while significantly reducing token consumption for long-form videos.
Comments: Accepted to CVPR 2026, code available at this https URL
Subjects:
Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Cite as: arXiv:2512.12360 [cs.CV]
(or arXiv:2512.12360v2 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2512.12360
arXiv-issued DOI via DataCite
Submission history
From: Zhou Yu [view email] [v1] Sat, 13 Dec 2025 15:11:03 UTC (8,330 KB) [v2] Mon, 30 Mar 2026 14:51:27 UTC (5,816 KB)
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Research Papers

Artificial intelligence assisted colorectal lesion detection in private practices a randomized controlled study
npj Digital Medicine, Published online: 01 April 2026; doi:10.1038/s41746-026-02576-8 Artificial intelligence assisted colorectal lesion detection in private practices a randomized controlled study

Robust Multidimensional Chinese Remainder Theorem (MD-CRT) with Non-Diagonal Moduli and Multi-Stage Framework
arXiv:2604.00995v1 Announce Type: new Abstract: The Chinese remainder theorem (CRT) provides an efficient way to reconstruct an integer from its remainders modulo several integer moduli, and has been widely applied in signal processing and information theory. Its multidimensional extension (MD-CRT) generalizes this principle to integer vectors and integer matrix moduli, enabling reconstruction in multidimensional signal processing scenarios. However, since matrices are generally non-commutative, the multidimensional extension introduces new theoretical and algorithmic challenges. When all matrix moduli are diagonal, the system is equivalent to applying the one-dimensional CRT independently along each dimension. This work first investigates whether non-diagonal (non-separable) moduli offer

Spatial Upper Bound of Radiated Power in Active Antenna Systems
arXiv:2604.00846v1 Announce Type: new Abstract: The assessment of unwanted radiated emissions from Active Antenna Systems (AAS) has become a critical issue in adjacent-band coexistence scenarios. In this paper, we establish the existence of a deterministic spatial upper bound on the radiated power of active antenna arrays. We show that the maximum radiated power always occurs in the boresight direction, irrespective of frequency or signal nature (useful signal, nonlinear distortion, or noise), or instantaneous beamforming configuration, thereby defining a conservative spatial upper bound whose angular envelope is solely determined by the elementary radiating building block of the antenna architecture, i.e., the element or sub-array radiation pattern. Starting from a two-element array with

Learning Laplacian Forms for Graph Signal Processing via the Deformed Laplacian
arXiv:2604.00728v1 Announce Type: new Abstract: Learning the graph Laplacian from observed data is one of the most investigated and fundamental tasks in Graph Signal Processing (GSP). Different variants of the Laplacian, such as the combinatorial, signless or signed Laplacians have been considered depending on the type of features to be extracted from the data. The main contribution of this paper is the introduction of a parametric Laplacian, called the deformed Laplacian, defined as a quadratic matrix polynomial that provides a parametric dictionary for graph signal processing. The deformed Laplacian can be interpreted as the generator of a parametric linear reaction-diffusion dynamics on graphs, capturing the interplay between diffusive coupling and nodal reaction effects. It is a parame

Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!