Research Papers research paper arxiv computer-vision image-recognition

AVATAR: Reinforcement Learning to See, Hear, and Reason Over Video

arXivMarch 31, 20262 min read0 views

arXiv:2508.03100v4 Announce Type: replace Abstract: Multimodal reasoning over long-horizon video is challenging due to the need for precise spatiotemporal fusion and alignment across modalities. While recent methods such as Group Relative Policy Optimization (GRPO) have shown promise in this domain, they suffer from three key limitations: (1) data inefficiency from their on-policy design, (2) a vanishing advantage problem, where identical or near-identical rewards within a group eliminate the learning signal by producing zero-valued advantages, and (3) uniform credit assignment that fails to e — Yogesh Kulkarni, Pooyan Fazli

View PDF HTML (experimental)

Abstract:Multimodal reasoning over long-horizon video is challenging due to the need for precise spatiotemporal fusion and alignment across modalities. While recent methods such as Group Relative Policy Optimization (GRPO) have shown promise in this domain, they suffer from three key limitations: (1) data inefficiency from their on-policy design, (2) a vanishing advantage problem, where identical or near-identical rewards within a group eliminate the learning signal by producing zero-valued advantages, and (3) uniform credit assignment that fails to emphasize critical reasoning steps. We introduce $\textbf{AVATAR}$ ($\textbf{A}$udio-$\textbf{V}$ideo $\textbf{A}$gen$\textbf{t}$ for $\textbf{A}$lignment and $\textbf{R}$easoning), a framework that addresses these limitations through two core components: (1) an off-policy training architecture that improves sample efficiency and resolves vanishing advantages by reusing past experiences with greater reward diversity, and (2) Temporal Advantage Shaping (TAS), a credit assignment strategy that emphasizes early (planning) and late (synthesis) reasoning phases. $\textbf{AVATAR}$ achieves strong performance across various benchmarks, outperforming the Qwen2.5-Omni baseline by $\mathbf{+5.4}$ on MMVU, $\mathbf{+4.9}$ on OmniBench, and $\mathbf{+4.5}$ on Video-Holmes. Furthermore, it surpasses standard GRPO by $\mathbf{+3.7}$ on OmniBench and $\mathbf{+1.9}$ on Video-Holmes, while demonstrating $\textbf{$5 $\times$ sample efficiency}$, requiring $80\%$ fewer generated completions to reach target performance.$

Comments: CVPR 2026

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as: arXiv:2508.03100 [cs.CV]

(or arXiv:2508.03100v4 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2508.03100

arXiv-issued DOI via DataCite

Submission history

From: Yogesh Kulkarni [view email] [v1] Tue, 5 Aug 2025 05:25:17 UTC (5,550 KB) [v2] Fri, 21 Nov 2025 00:10:21 UTC (5,678 KB) [v3] Mon, 24 Nov 2025 03:34:12 UTC (5,678 KB) [v4] Sun, 29 Mar 2026 04:54:47 UTC (5,655 KB)

Original source

arXiv

https://arxiv.org/abs/2508.03100

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

researchpaperarxiv

ProductsLive

Reverberation-Robust Localization of Speakers Using Distinct Speech Onsets and Multi-channel Cross-Correlations

arXiv:2604.01524v1 Announce Type: new Abstract: Many speaker localization methods can be found in the literature. However, speaker localization under strong reverberation still remains a major challenge in the real-world applications. This paper proposes two algorithms for localizing speakers using microphone array recordings of reverberated sounds. To separate concurrent speakers, the first algorithm decomposes microphone signals spectrotemporally into subbands via an auditory filterbank. To suppress reverberation, we propose a novel speech onset detection approach derived from the speech signal and impulse response models, and further propose to formulate the multi-channel cross-correlation coefficient (MCCC) of encoded speech onsets in each subband. The subband results are combined to e

arXiv eess.AS

1m11 minutes ago

ProductsLive

Beatty Sequences for a Quadratic Irrational: Decidability and Applications

arXiv:2402.08331v3 Announce Type: replace-cross Abstract: Let $\alpha$ and $\beta$ belong to the same quadratic field. We show that the inhomogeneous Beatty sequence $(\lfloor n \alpha + \beta \rfloor)_{n \geq 1}$ is synchronized, in the sense that there is a finite automaton that takes as input the Ostrowski representations of $n$ and $y$ in parallel, and accepts if and only if $y = \lfloor n \alpha + \beta \rfloor$. Since it is already known that the addition relation is computable for Ostrowski representations based on a quadratic number, a consequence is a new and rather simple proof that the first-order logical theory of these sequences with addition is decidable. The decision procedure is easily implemented in the free software Walnut. As an application, we show that for each $r \geq

arXiv cs.FL

1m11 minutes ago

ReleasesLive

Faster Symmetric Rendezvous on Four or More Locations

arXiv:2604.02058v1 Announce Type: cross Abstract: In the symmetric rendezvous problem two players follow the same (randomized) strategy to visit one of $n$ locations in each time step $t=0,1,2,\dots$. Their goal is to minimize the expected time until they visit the same location and thus meet. Anderson and Weber [J. Appl. Prob., 1990] proposed a strategy that operates in rounds of $n-1$ steps: a player either remains in one location for $n-1$ steps or visits the other $n-1$ locations in random order; the choice between these two options is made with a probability that depends only on $n$. The strategy is known to be optimal for $n=2$ and $n=3$, and there is convincing evidence that it is not optimal for $n=4$. We show that it is not optimal for any $n\geq 4$, by constructing a strategy wit

arXiv cs.GT

1m11 minutes ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 340 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Research Papers

Research PapersLive

Complexity of Linear Subsequences of $k$-Automatic Sequences

arXiv:2512.10017v5 Announce Type: replace Abstract: We construct automata with input(s) in base $k$ recognizing some basic relations and study their number of states. We also consider some basic operations on $k$-automatic sequences $(h(i))_{i \geq 0}$ and discuss their state complexity. We find a relationship between subword complexity of the interior sequence $(h'(i))_{i \geq 0}$ and state complexity of the linear subsequence $(h(ni+c))_{i \geq 0}$. We resolve a recent question of Zantema and Bosma about linear subsequences of $k$-automatic sequences with input in most-significant-digit-first format. We also discuss the state complexity and runtime complexity of using a reasonable interpretation of B\"uchi arithmetic to actually construct some of the studied automata recognizing relation

arXiv cs.FL

1m11 minutes ago

Research PapersLive

Designing for Patient Voice in Interactive Health

arXiv:2604.01558v1 Announce Type: new Abstract: Interactive Health (IH) research increasingly engages patients through participatory and user-centred approaches. However, patients' lived experiences are typically treated more as data to be analysed than as knowledge in their own right. In this paper, I argue that 'patient voice' in the field of IH is both an inclusion issue and an epistemic one. More specifically, it concerns how experiential accounts are recognised and circulated. I examine how methodological conventions, authorship norms, review criteria, and publication formats tend to position patients as participants rather than as authors of evidence. Looking to patient-partnered practices in medical publishing, including The BMJ, JAMA, and British Journal of Sports Medicine, I outli

arXiv cs.HC

1m11 minutes ago

Research PapersLive

Balancing Morality and Economics: Population Games with Herding and Inertia

arXiv:2604.02030v1 Announce Type: cross Abstract: The adoption of clean technologies (CTs) plays an important role in reducing carbon dioxide (CO$_2$) emissions. We study CT adoption in a large population of consumers with heterogeneous behavioral tendencies. We model the interaction among the agents as a multi-type mean-field game in which the agents choose between clean and polluting technology based products and may either behave as rationals (trading off price and moral incentives), herding agents (just follow the majority), or lethargic agents exhibiting inertia toward adopting the new technologies. We characterize equilibrium CT adoption levels using the recently introduced notion of $\boldsymbol{\alpha}$-Rational Nash Equilibrium ($\boldsymbol{\alpha}$-RNE) and its multi-type extens

arXiv cs.GT

1m11 minutes ago

Research PapersLive

Machine Learning for Network Attacks Classification and Statistical Evaluation of Adversarial Learning Methodologies for Synthetic Data Generation

arXiv:2603.17717v2 Announce Type: replace-cross Abstract: Supervised detection of network attacks has always been a critical part of network intrusion detection systems (NIDS). Nowadays, in a pivotal time for artificial intelligence (AI), with even more sophisticated attacks that utilize advanced techniques, such as generative artificial intelligence (GenAI) and reinforcement learning, it has become a vital component if we wish to protect our personal data, which are scattered across the web. In this paper, we address two tasks, in the first unified multi-modal NIDS dataset, which incorporates — Iakovos-Christos Zarkadis, Christos Douligeris

arXiv

2m11 minutes ago