Live
Black Hat USAAI BusinessBlack Hat AsiaAI BusinessIntel koopt zijn Ierse 3nm -chipfabriek volledig terug van investeringsbedrijfTweakers.netAlibaba Pushes Deeper Into AI Coding Race - GuruFocusGNews AI Alibabab8629llama.cpp ReleasesAI Slop DetectorHacker News AI TopRambus Unveils HBM4E Controller: 16 GT/s, 2,048-Bit Interface, Enabling C-HBM4EEE TimesGPT reasoning models have "line of sight" to AGI, says OpenAI's Greg Brockman - the-decoder.comGoogle News: OpenAIGPT reasoning models have "line of sight" to AGI, says OpenAI s Greg BrockmanThe DecoderCornell study reveals AI can regenerate famous books with amazing accuracy, sparks copyright concerns - India TodayGNews AI copyrightStudy Finds ChatGPT May Help You Learn Faster, But There's a Catch - ScienceAlertGoogle News: ChatGPTThe Sequence Chat #835: Illia Polosukhin on NEAR AI, Authoring the Transformer Paper and Decentralized and Private AI - TheSequenceGoogle News: Machine LearningOpenClaw Unlocks China’s AI Token Export BusinessBloomberg TechnologySector Snapshot: Venture Funding To Foundational AI Startups In Q1 Was Double All Of 2025Crunchbase NewsBlack Hat USAAI BusinessBlack Hat AsiaAI BusinessIntel koopt zijn Ierse 3nm -chipfabriek volledig terug van investeringsbedrijfTweakers.netAlibaba Pushes Deeper Into AI Coding Race - GuruFocusGNews AI Alibabab8629llama.cpp ReleasesAI Slop DetectorHacker News AI TopRambus Unveils HBM4E Controller: 16 GT/s, 2,048-Bit Interface, Enabling C-HBM4EEE TimesGPT reasoning models have "line of sight" to AGI, says OpenAI's Greg Brockman - the-decoder.comGoogle News: OpenAIGPT reasoning models have "line of sight" to AGI, says OpenAI s Greg BrockmanThe DecoderCornell study reveals AI can regenerate famous books with amazing accuracy, sparks copyright concerns - India TodayGNews AI copyrightStudy Finds ChatGPT May Help You Learn Faster, But There's a Catch - ScienceAlertGoogle News: ChatGPTThe Sequence Chat #835: Illia Polosukhin on NEAR AI, Authoring the Transformer Paper and Decentralized and Private AI - TheSequenceGoogle News: Machine LearningOpenClaw Unlocks China’s AI Token Export BusinessBloomberg TechnologySector Snapshot: Venture Funding To Foundational AI Startups In Q1 Was Double All Of 2025Crunchbase News
AI NEWS HUBbyEIGENVECTOREigenvector

InternVideo-Next: Towards General Video Foundation Models without Video-Text Supervision

arXivMarch 31, 20262 min read0 views
Source Quiz

arXiv:2512.01342v2 Announce Type: replace Abstract: Large-scale video-text pretraining achieves strong performance but depends on noisy, synthetic captions with limited semantic coverage, often overlooking implicit world knowledge such as object motion, 3D geometry, and physical cues. In contrast, masked video modeling (MVM) directly exploits spatiotemporal structures but trails text-supervised methods on general tasks. We find this gap arises from overlooked architectural issues: pixel-level reconstruction struggles with convergence and its low-level requirement often conflicts with semantics — Chenting Wang, Yuhan Zhu, Yicheng Xu, Jiange Yang, Lang Lin, Ziang Yan, Yali Wang, Yi Wang, Limin Wang

View PDF HTML (experimental)

Abstract:Large-scale video-text pretraining achieves strong performance but depends on noisy, synthetic captions with limited semantic coverage, often overlooking implicit world knowledge such as object motion, 3D geometry, and physical cues. In contrast, masked video modeling (MVM) directly exploits spatiotemporal structures but trails text-supervised methods on general tasks. We find this gap arises from overlooked architectural issues: pixel-level reconstruction struggles with convergence and its low-level requirement often conflicts with semantics, while latent prediction often encourages shortcut learning. To address these, we disentangle the traditional encoder-decoder design into an Encoder-Predictor-Decoder (EPD) framework, where the predictor acts as a latent world model, and propose InternVideo-Next, a two-stage pretraining scheme that builds a semantically consistent yet detail-preserving latent space for this world model. First, conventional linear decoder in pixel MVM enforces the predictor output latent to be linearly projected to, thus separable in pixel space, causing the conflict with semantic abstraction. Our Stage 1 proposes a conditional diffusion decoder and injects reliable image-level semantic priors to enhance semantics and convergence, thus bridging pixel-level fidelity with high-level semantic abstraction. Stage 2 further learns world knowledge by predicting frozen Stage 1 targets within this space, mitigating shortcut learning. Trained on public, unlabeled videos, InternVideo-Next achieves state-of-the-art results across benchmarks and provides a scalable path toward general video representation learning.

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as: arXiv:2512.01342 [cs.CV]

(or arXiv:2512.01342v2 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2512.01342

arXiv-issued DOI via DataCite

Journal reference: CVPR2026

Submission history

From: Chenting Wang [view email] [v1] Mon, 1 Dec 2025 06:57:39 UTC (1,140 KB) [v2] Mon, 30 Mar 2026 04:56:55 UTC (1,139 KB)

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by Eigenvector · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

Knowledge Map

Knowledge Map
TopicsEntitiesSource
InternVideo…researchpaperarxivcomputer-vi…image-recog…arXiv

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 117 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!

More in Research Papers