Live
Black Hat USAAI BusinessBlack Hat AsiaAI BusinessMassachusetts Sen. Ed Markey is putting AV firms on blast for using human staffersFast Company TechThe New Duet: AI as Creative MediumDev.to AIThree Things Had to Align: The Real Story Behind the LLM RevolutionDev.to AIBig Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.Dev.to AIThe World of AIDev.to AIIntel to Report First-Quarter 2026 Financial Resultsnewsroom.intel.com# I Tried 7 AI Tools for 30 Days — Here’s How I Made My First $300 Online (Beginner Friendly)Medium AIHow TurboQuant Works for LLMs and Why It Uses Much Less RAMDev.to AIb8598llama.cpp ReleasesThe Worst Case & Deceptive Best Case scenario of 21st CenturyMedium AIThe Claude Code Leak Proved What We've Been Building ForDev.to AIHuman-AI Interaction Is Here: Why Your Current UX/UI Design Is Already ObsoleteDev.to AIBlack Hat USAAI BusinessBlack Hat AsiaAI BusinessMassachusetts Sen. Ed Markey is putting AV firms on blast for using human staffersFast Company TechThe New Duet: AI as Creative MediumDev.to AIThree Things Had to Align: The Real Story Behind the LLM RevolutionDev.to AIBig Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.Dev.to AIThe World of AIDev.to AIIntel to Report First-Quarter 2026 Financial Resultsnewsroom.intel.com# I Tried 7 AI Tools for 30 Days — Here’s How I Made My First $300 Online (Beginner Friendly)Medium AIHow TurboQuant Works for LLMs and Why It Uses Much Less RAMDev.to AIb8598llama.cpp ReleasesThe Worst Case & Deceptive Best Case scenario of 21st CenturyMedium AIThe Claude Code Leak Proved What We've Been Building ForDev.to AIHuman-AI Interaction Is Here: Why Your Current UX/UI Design Is Already ObsoleteDev.to AI

EVA: Efficient Reinforcement Learning for End-to-End Video Agent

arXivMarch 30, 202610 min read0 views
Source Quiz

arXiv:2603.22918v2 Announce Type: replace-cross Abstract: Video understanding with multimodal large language models (MLLMs) remains challenging due to the long token sequences of videos, which contain extensive temporal dependencies and redundant frames. Existing approaches typically treat MLLMs as passive recognizers, processing entire videos or uniformly sampled frames without adaptive reasoning. Recent agent-based methods introduce external tools, yet still depend on manually designed workflows and perception-first strategies, resulting in inefficiency on long videos. We present EVA, an Eff — Yaolun Zhang, Ruohui Wang, Jiahao Wang, Yepeng Tang, Xuanyu Zheng, Haonan Duan, Hao Lu, Hanming Deng, Lewei Lu

View PDF HTML (experimental)

Abstract:Video understanding with multimodal large language models (MLLMs) remains challenging due to the long token sequences of videos, which contain extensive temporal dependencies and redundant frames. Existing approaches typically treat MLLMs as passive recognizers, processing entire videos or uniformly sampled frames without adaptive reasoning. Recent agent-based methods introduce external tools, yet still depend on manually designed workflows and perception-first strategies, resulting in inefficiency on long videos. We present EVA, an Efficient Reinforcement Learning framework for End-to-End Video Agent, which enables planning-before-perception through iterative summary-plan-action-reflection reasoning. EVA autonomously decides what to watch, when to watch, and how to watch, achieving query-driven and efficient video understanding. To train such agents, we design a simple yet effective three-stage learning pipeline - comprising supervised fine-tuning (SFT), Kahneman-Tversky Optimization (KTO), and Group Relative Policy Optimization (GRPO) - that bridges supervised imitation and reinforcement learning. We further construct high-quality datasets for each stage, supporting stable and reproducible training. We evaluate EVA on six video understanding benchmarks, demonstrating its comprehensive capabilities. Compared with existing baselines, EVA achieves a substantial improvement of 6-12% over general MLLM baselines and a further 1-3% gain over prior adaptive agent methods.

Comments: CVPR2026

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

Cite as: arXiv:2603.22918 [cs.CV]

(or arXiv:2603.22918v2 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2603.22918

arXiv-issued DOI via DataCite

Submission history

From: Yaolun Zhang [view email] [v1] Tue, 24 Mar 2026 08:06:29 UTC (17,676 KB) [v2] Thu, 26 Mar 2026 20:03:37 UTC (16,428 KB)

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by AI News Hub · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

researchpaperarxiv

Knowledge Map

Knowledge Map
TopicsEntitiesSource
EVA: Effici…researchpaperarxivaiartificial-…arXiv

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 178 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!

More in Research Papers