Research Papers research paper arxiv computer-vision image-recognition

TrackMAE: Video Representation Learning via Track Mask and Predict

arXivMarch 31, 20262 min read0 views

arXiv:2603.27268v1 Announce Type: new Abstract: Masked video modeling (MVM) has emerged as a simple and scalable self-supervised pretraining paradigm, but only encodes motion information implicitly, limiting the encoding of temporal dynamics in the learned representations. As a result, such models struggle on motion-centric tasks that require fine-grained motion awareness. To address this, we propose TrackMAE, a simple masked video modeling paradigm that explicitly uses motion information as a reconstruction signal. In TrackMAE, we use an off-the-shelf point tracker to sparsely track points in — Renaud Vandeghen, Fida Mohammad Thoker, Marc Van Droogenbroeck, Bernard Ghanem

View PDF HTML (experimental)

Abstract:Masked video modeling (MVM) has emerged as a simple and scalable self-supervised pretraining paradigm, but only encodes motion information implicitly, limiting the encoding of temporal dynamics in the learned representations. As a result, such models struggle on motion-centric tasks that require fine-grained motion awareness. To address this, we propose TrackMAE, a simple masked video modeling paradigm that explicitly uses motion information as a reconstruction signal. In TrackMAE, we use an off-the-shelf point tracker to sparsely track points in the input videos, generating motion trajectories. Furthermore, we exploit the extracted trajectories to improve random tube masking with a motion-aware masking strategy. We enhance video representations learned in both pixel and feature semantic reconstruction spaces by providing a complementary supervision signal in the form of motion targets. We evaluate on six datasets across diverse downstream settings and find that TrackMAE consistently outperforms state-of-the-art video self-supervised learning baselines, learning more discriminative and generalizable representations. Code available at this https URL

Comments: Accepted to CVPR 2026

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as: arXiv:2603.27268 [cs.CV]

(or arXiv:2603.27268v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2603.27268

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Renaud Vandeghen [view email] [v1] Sat, 28 Mar 2026 13:35:23 UTC (4,130 KB)

Original source

arXiv

https://arxiv.org/abs/2603.27268

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

researchpaperarxiv

Market NewsFresh

AI and Robotics in Aerospace and Defense Market to Reach USD 39.9 Billion by 2032, Says Stratview Research - openpr.com

AI and Robotics in Aerospace and Defense Market to Reach USD 39.9 Billion by 2032, Says Stratview Research openpr.com

Google News - AI robotics

1mabout 4 hours ago

CountriesLive

With an eye on China, Japan looks to kamikaze drones and low-cost missiles

Japan plans to introduce a fleet of kamikaze drones and low-cost missiles to boost deterrence against regional threats including China, according to Japanese media reports. The Yomiuri newspaper and Kyodo news agency reported on Wednesday that the strategy was focused on “integrated attacks” from unmanned aerial vehicles and long-range stand-off missiles, citing government and ruling coalition sources. They said the drones and missiles would be used to break down enemy air defences and...

SCMP Tech (Asia AI)

2m18 minutes ago

Research Papers

Neo-Nazi Exploitation Online: AI Voice-Cloning and the Revival of Hitler Speeches - gnet-research.org

Neo-Nazi Exploitation Online: AI Voice-Cloning and the Revival of Hitler Speeches gnet-research.org

GNews AI voice

1m4 months ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 156 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Research Papers

Research Papers

Neo-Nazi Exploitation Online: AI Voice-Cloning and the Revival of Hitler Speeches - gnet-research.org

Neo-Nazi Exploitation Online: AI Voice-Cloning and the Revival of Hitler Speeches gnet-research.org

GNews AI voice

1m4 months ago

Research PapersFresh

Realistic Lip Motion Generation Based on 3D Dynamic Viseme and Coarticulation Modeling for Human-Robot Interaction

arXiv:2604.01756v1 Announce Type: new Abstract: Realistic lip synchronization is essential for the natural human-robot non-verbal interaction of humanoid robots. Motivated by this need, this paper presents a lip motion generation framework based on 3D dynamic viseme and coarticulation modeling. By analyzing Chinese pronunciation theory, a 3D dynamic viseme library is constructed based on the ARKit standard, which offers coherent prior trajectories of lips. To resolve motion conflicts within continuous speech streams, a coarticulation mechanism is developed by incorporating initial-final (Shengmu-Yunmu) decoupling and energy modulation. After developing a strategy to retarget high-dimensional spatial lip motion to a 14-DOF lip actuation system of a humanoid head platform, the efficiency and

arXiv cs.RO

2mabout 6 hours ago

Research PapersFresh

3-D Relative Localization for Multi-Robot Systems with Angle and Self-Displacement Measurements

arXiv:2604.01703v1 Announce Type: new Abstract: Realizing relative localization by leveraging inter-robot local measurements is a challenging problem, especially in the presence of measurement noise. Motivated by this challenge, in this paper we propose a novel and systematic 3-D relative localization framework based on inter-robot interior angle and self-displacement measurements. Initially, we propose a linear relative localization theory comprising a distributed linear relative localization algorithm and sufficient conditions for localizability. According to this theory, robots can determine their neighbors' relative positions and orientations in a purely linear manner. Subsequently, in order to deal with measurement noise, we present an advanced Maximum a Posterior (MAP) estimator by a

arXiv cs.RO

2mabout 6 hours ago

Research PapersFresh

Coupler Position Optimization and Channel Estimation for Flexible Coupler Antenna Aided Multiuser Communication

arXiv:2602.11319v2 Announce Type: replace-cross Abstract: In this paper, we propose a distributed flexible coupler antenna (FCA) array to enhance communication performance with low hardware cost. At each FCA, there is one fixed-position active antenna and multiple passive couplers that can move within a designated region around the active antenna. Moreover, each FCA is equipped with a local processing unit (LPU). All LPUs exchange signals with a central processing unit (CPU) for joint signal processing. We study an FCA-aided multiuser multiple-input multiple-output (MIMO) system, where an FCA array base station (BS) is deployed to enhance the downlink communication between the BS and multiple single-antenna users. We formulate optimization problems to maximize the achievable sum rate of us

arXiv eess.SP

2mabout 6 hours ago