Live
Black Hat USAAI BusinessBlack Hat AsiaAI BusinessMeta’s natural gas binge could power South DakotaTechCrunch AIAnthropic's leaked AI coding tool has been cloned over 8,000 times on GitHub despite mass takedownsThe DecoderStop Writing Zod Schemas by Hand: What I Learned After 40 API EndpointsDEV CommunityBuilding an Engineering & Security News Aggregator (10 Sources, No APIs)DEV CommunityNietzsche in a MadhouseDEV CommunityBuzzFeed Is Dying Because It Bet Everything on AI — And Its CEO Still Won't Admit ItDEV CommunityDistributed Systems - Lamport Clock vs Hybrid Logical ClocksDEV CommunityThursday: April 2 - AI, ML and Computer Vision MeetupDEV CommunityFed's Barr Says Stablecoins Need Tighter Controls to Fight Money LaunderingDecrypt AIThe Architecture of Forgetting.DEV CommunityWhy Your AI Agent Health Check Is Lying to YouDEV CommunityDetermine High-Performing Data Ingestion And Transformation SolutionsDEV CommunityBlack Hat USAAI BusinessBlack Hat AsiaAI BusinessMeta’s natural gas binge could power South DakotaTechCrunch AIAnthropic's leaked AI coding tool has been cloned over 8,000 times on GitHub despite mass takedownsThe DecoderStop Writing Zod Schemas by Hand: What I Learned After 40 API EndpointsDEV CommunityBuilding an Engineering & Security News Aggregator (10 Sources, No APIs)DEV CommunityNietzsche in a MadhouseDEV CommunityBuzzFeed Is Dying Because It Bet Everything on AI — And Its CEO Still Won't Admit ItDEV CommunityDistributed Systems - Lamport Clock vs Hybrid Logical ClocksDEV CommunityThursday: April 2 - AI, ML and Computer Vision MeetupDEV CommunityFed's Barr Says Stablecoins Need Tighter Controls to Fight Money LaunderingDecrypt AIThe Architecture of Forgetting.DEV CommunityWhy Your AI Agent Health Check Is Lying to YouDEV CommunityDetermine High-Performing Data Ingestion And Transformation SolutionsDEV Community

AdaptToken: Entropy-based Adaptive Token Selection for MLLM Long Video Understanding

arXivMarch 31, 202610 min read0 views
Source Quiz

arXiv:2603.28696v1 Announce Type: cross Abstract: Long video understanding remains challenging for Multi-modal Large Language Models (MLLMs) due to high memory costs and context-length limits. Prior approaches mitigate this by scoring and selecting frames/tokens within short clips, but they lack a principled mechanism to (i) compare relevance across distant video clips and (ii) stop processing once sufficient evidence has been gathered. We propose AdaptToken, a training-free framework that turns an MLLM's self-uncertainty into a global control signal for long-video token selection. AdaptToken — Haozhe Qi, Kevin Qu, Mahdi Rad, Rui Wang, Alexander Mathis, Marc Pollefeys

View PDF

Abstract:Long video understanding remains challenging for Multi-modal Large Language Models (MLLMs) due to high memory costs and context-length limits. Prior approaches mitigate this by scoring and selecting frames/tokens within short clips, but they lack a principled mechanism to (i) compare relevance across distant video clips and (ii) stop processing once sufficient evidence has been gathered. We propose AdaptToken, a training-free framework that turns an MLLM's self-uncertainty into a global control signal for long-video token selection. AdaptToken splits a video into groups, extracts cross-modal attention to rank tokens within each group, and uses the model's response entropy to estimate each group's prompt relevance. This entropy signal enables a global token budget allocation across groups and further supports early stopping (AdaptToken-Lite), skipping the remaining groups when the model becomes sufficiently certain. Across four long-video benchmarks (VideoMME, LongVideoBench, LVBench, and MLVU) and multiple base MLLMs (7B-72B), AdaptToken consistently improves accuracy (e.g., +6.7 on average over Qwen2.5-VL 7B) and continues to benefit from extremely long inputs (up to 10K frames), while AdaptToken-Lite reduces inference time by about half with comparable performance. Project page: this https URL

Comments: Project page: this https URL

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Cite as: arXiv:2603.28696 [cs.CV]

(or arXiv:2603.28696v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2603.28696

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Haozhe Qi [view email] [v1] Mon, 30 Mar 2026 17:14:15 UTC (2,066 KB)

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by AI News Hub · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

researchpaperarxiv

Knowledge Map

Knowledge Map
TopicsEntitiesSource
AdaptToken:…researchpaperarxivaiartificial-…arXiv

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 235 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!

More in Research Papers