Research Papers research paper arxiv computer-vision image-recognition

From Static to Dynamic: Exploring Self-supervised Image-to-Video Representation Transfer Learning

arXivMarch 30, 202610 min read0 views

arXiv:2603.26597v1 Announce Type: new Abstract: Recent studies have made notable progress in video representation learning by transferring image-pretrained models to video tasks, typically with complex temporal modules and video fine-tuning. However, fine-tuning heavy modules may compromise inter-video semantic separability, i.e., the essential ability to distinguish objects across videos. While reducing the tunable parameters hinders their intra-video temporal consistency, which is required for stable representations of the same object within a video. This dilemma indicates a potential trade- — Yang Liu, Qianqian Xu, Peisong Wen, Siran Dai, Xilin Zhao, Qingming Huang

View PDF

Abstract:Recent studies have made notable progress in video representation learning by transferring image-pretrained models to video tasks, typically with complex temporal modules and video fine-tuning. However, fine-tuning heavy modules may compromise inter-video semantic separability, i.e., the essential ability to distinguish objects across videos. While reducing the tunable parameters hinders their intra-video temporal consistency, which is required for stable representations of the same object within a video. This dilemma indicates a potential trade-off between the intra-video temporal consistency and inter-video semantic separability during image-to-video transfer. To this end, we propose the Consistency-Separability Trade-off Transfer Learning (Co-Settle) framework, which applies a lightweight projection layer on top of the frozen image-pretrained encoder to adjust representation space with a temporal cycle consistency objective and a semantic separability constraint. We further provide a theoretical support showing that the optimized projection yields a better trade-off between the two properties under appropriate conditions. Experiments on eight image-pretrained models demonstrate consistent improvements across multiple levels of video tasks with only five epochs of self-supervised training. The code is available at this https URL.

Comments: Accepted at CVPR 2026

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as: arXiv:2603.26597 [cs.CV]

(or arXiv:2603.26597v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2603.26597

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Yang Liu [view email] [v1] Fri, 27 Mar 2026 16:56:50 UTC (16,863 KB)

Original source

arXiv

https://arxiv.org/abs/2603.26597

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

researchpaperarxiv

Research PapersLive

First time NeurIPS. How different is it from low-ranked conferences? [D]

I'm a PhD student and already published papers in A/B ranked paper (10+). My field of work never allowed me to work on something really exciting and a core A* conference. But finally after years I think I have work worthy of some discussion at the top venue. I'm referring to papers (my field and top papers) from previous editions and I notice that there's a big difference on how people write, how they put their message on table and also it is too theoretical sometimes. Are there any golden rules people follow who frequently get into these conferences? Should I be soft while making novelty claims? Also those who moved from submitting to niche-conferences to NeurIPS/ICML/CVPR, did you change your approach? My field is imaging in healthcare. submitted by /u/ade17_in [link] [comments]

Reddit r/MachineLearning

1mabout 1 hour ago

Frontier ResearchLive

AI can describe human experiences but lacks experience in an actual ‘body.’ UCLA researchers say understanding this ‘body gap’ may matter for safety - UCLA Health

AI can describe human experiences but lacks experience in an actual ‘body.’ UCLA researchers say understanding this ‘body gap’ may matter for safety UCLA Health