AVATAR: Reinforcement Learning to See, Hear, and Reason Over Video
arXiv:2508.03100v4 Announce Type: replace Abstract: Multimodal reasoning over long-horizon video is challenging due to the need for precise spatiotemporal fusion and alignment across modalities. While recent methods such as Group Relative Policy Optimization (GRPO) have shown promise in this domain, they suffer from three key limitations: (1) data inefficiency from their on-policy design, (2) a vanishing advantage problem, where identical or near-identical rewards within a group eliminate the learning signal by producing zero-valued advantages, and (3) uniform credit assignment that fails to e — Yogesh Kulkarni, Pooyan Fazli
View PDF HTML (experimental)
Abstract:Multimodal reasoning over long-horizon video is challenging due to the need for precise spatiotemporal fusion and alignment across modalities. While recent methods such as Group Relative Policy Optimization (GRPO) have shown promise in this domain, they suffer from three key limitations: (1) data inefficiency from their on-policy design, (2) a vanishing advantage problem, where identical or near-identical rewards within a group eliminate the learning signal by producing zero-valued advantages, and (3) uniform credit assignment that fails to emphasize critical reasoning steps. We introduce $\textbf{AVATAR}$ ($\textbf{A}$udio-$\textbf{V}$ideo $\textbf{A}$gen$\textbf{t}$ for $\textbf{A}$lignment and $\textbf{R}$easoning), a framework that addresses these limitations through two core components: (1) an off-policy training architecture that improves sample efficiency and resolves vanishing advantages by reusing past experiences with greater reward diversity, and (2) Temporal Advantage Shaping (TAS), a credit assignment strategy that emphasizes early (planning) and late (synthesis) reasoning phases. $\textbf{AVATAR}$ achieves strong performance across various benchmarks, outperforming the Qwen2.5-Omni baseline by $\mathbf{+5.4}$ on MMVU, $\mathbf{+4.9}$ on OmniBench, and $\mathbf{+4.5}$ on Video-Holmes. Furthermore, it surpasses standard GRPO by $\mathbf{+3.7}$ on OmniBench and $\mathbf{+1.9}$ on Video-Holmes, while demonstrating $\textbf{$5\times$ sample efficiency}$, requiring $80\%$ fewer generated completions to reach target performance.
Comments: CVPR 2026
Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Cite as: arXiv:2508.03100 [cs.CV]
(or arXiv:2508.03100v4 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2508.03100
arXiv-issued DOI via DataCite
Submission history
From: Yogesh Kulkarni [view email] [v1] Tue, 5 Aug 2025 05:25:17 UTC (5,550 KB) [v2] Fri, 21 Nov 2025 00:10:21 UTC (5,678 KB) [v3] Mon, 24 Nov 2025 03:34:12 UTC (5,678 KB) [v4] Sun, 29 Mar 2026 04:54:47 UTC (5,655 KB)
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
researchpaperarxiv
Reverberation-Robust Localization of Speakers Using Distinct Speech Onsets and Multi-channel Cross-Correlations
arXiv:2604.01524v1 Announce Type: new Abstract: Many speaker localization methods can be found in the literature. However, speaker localization under strong reverberation still remains a major challenge in the real-world applications. This paper proposes two algorithms for localizing speakers using microphone array recordings of reverberated sounds. To separate concurrent speakers, the first algorithm decomposes microphone signals spectrotemporally into subbands via an auditory filterbank. To suppress reverberation, we propose a novel speech onset detection approach derived from the speech signal and impulse response models, and further propose to formulate the multi-channel cross-correlation coefficient (MCCC) of encoded speech onsets in each subband. The subband results are combined to e

Beatty Sequences for a Quadratic Irrational: Decidability and Applications
arXiv:2402.08331v3 Announce Type: replace-cross Abstract: Let $\alpha$ and $\beta$ belong to the same quadratic field. We show that the inhomogeneous Beatty sequence $(\lfloor n \alpha + \beta \rfloor)_{n \geq 1}$ is synchronized, in the sense that there is a finite automaton that takes as input the Ostrowski representations of $n$ and $y$ in parallel, and accepts if and only if $y = \lfloor n \alpha + \beta \rfloor$. Since it is already known that the addition relation is computable for Ostrowski representations based on a quadratic number, a consequence is a new and rather simple proof that the first-order logical theory of these sequences with addition is decidable. The decision procedure is easily implemented in the free software Walnut. As an application, we show that for each $r \geq

Faster Symmetric Rendezvous on Four or More Locations
arXiv:2604.02058v1 Announce Type: cross Abstract: In the symmetric rendezvous problem two players follow the same (randomized) strategy to visit one of $n$ locations in each time step $t=0,1,2,\dots$. Their goal is to minimize the expected time until they visit the same location and thus meet. Anderson and Weber [J. Appl. Prob., 1990] proposed a strategy that operates in rounds of $n-1$ steps: a player either remains in one location for $n-1$ steps or visits the other $n-1$ locations in random order; the choice between these two options is made with a probability that depends only on $n$. The strategy is known to be optimal for $n=2$ and $n=3$, and there is convincing evidence that it is not optimal for $n=4$. We show that it is not optimal for any $n\geq 4$, by constructing a strategy wit
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Research Papers

Complexity of Linear Subsequences of $k$-Automatic Sequences
arXiv:2512.10017v5 Announce Type: replace Abstract: We construct automata with input(s) in base $k$ recognizing some basic relations and study their number of states. We also consider some basic operations on $k$-automatic sequences $(h(i))_{i \geq 0}$ and discuss their state complexity. We find a relationship between subword complexity of the interior sequence $(h'(i))_{i \geq 0}$ and state complexity of the linear subsequence $(h(ni+c))_{i \geq 0}$. We resolve a recent question of Zantema and Bosma about linear subsequences of $k$-automatic sequences with input in most-significant-digit-first format. We also discuss the state complexity and runtime complexity of using a reasonable interpretation of B\"uchi arithmetic to actually construct some of the studied automata recognizing relation

Designing for Patient Voice in Interactive Health
arXiv:2604.01558v1 Announce Type: new Abstract: Interactive Health (IH) research increasingly engages patients through participatory and user-centred approaches. However, patients' lived experiences are typically treated more as data to be analysed than as knowledge in their own right. In this paper, I argue that 'patient voice' in the field of IH is both an inclusion issue and an epistemic one. More specifically, it concerns how experiential accounts are recognised and circulated. I examine how methodological conventions, authorship norms, review criteria, and publication formats tend to position patients as participants rather than as authors of evidence. Looking to patient-partnered practices in medical publishing, including The BMJ, JAMA, and British Journal of Sports Medicine, I outli

Balancing Morality and Economics: Population Games with Herding and Inertia
arXiv:2604.02030v1 Announce Type: cross Abstract: The adoption of clean technologies (CTs) plays an important role in reducing carbon dioxide (CO$_2$) emissions. We study CT adoption in a large population of consumers with heterogeneous behavioral tendencies. We model the interaction among the agents as a multi-type mean-field game in which the agents choose between clean and polluting technology based products and may either behave as rationals (trading off price and moral incentives), herding agents (just follow the majority), or lethargic agents exhibiting inertia toward adopting the new technologies. We characterize equilibrium CT adoption levels using the recently introduced notion of $\boldsymbol{\alpha}$-Rational Nash Equilibrium ($\boldsymbol{\alpha}$-RNE) and its multi-type extens

Machine Learning for Network Attacks Classification and Statistical Evaluation of Adversarial Learning Methodologies for Synthetic Data Generation
arXiv:2603.17717v2 Announce Type: replace-cross Abstract: Supervised detection of network attacks has always been a critical part of network intrusion detection systems (NIDS). Nowadays, in a pivotal time for artificial intelligence (AI), with even more sophisticated attacks that utilize advanced techniques, such as generative artificial intelligence (GenAI) and reinforcement learning, it has become a vital component if we wish to protect our personal data, which are scattered across the web. In this paper, we address two tasks, in the first unified multi-modal NIDS dataset, which incorporates — Iakovos-Christos Zarkadis, Christos Douligeris

Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!