Live
Black Hat USADark ReadingBlack Hat AsiaAI BusinessEconomists Once Dismissed the A.I. Job Threat, but Not Anymore - nytimes.comGoogle News: AITech billionaires want to put data centers in space. The math could get ugly fast.Business InsiderStudent Parker Jones calls out college professors for being slow on AIBusiness InsiderThe clock is ticking on law's billable hour, says a top Cleary Gottlieb lawyerBusiness Insider90000 Tech Workers Got Fired This Year and Everyone Is Blaming AI but Thats Not the Whole StoryDev.to AILarge language models: the AI systems clinicians are now encountering - Irish Medical TimesGoogle News: LLMMicrosoft’s $10 Billion Japan Bet Shows the Next AI Battleground Is National InfrastructureDev.to AIHow Cloud-Based Data Systems Are Transforming BusinessesDev.to AIJira for AI Agents & HumansDev.to AIDeepMind flags six web based attacks that can hijack AI agents - crypto.newsGoogle News: DeepMindSamsung’s Record Q1 Profit Looks Like An AI Memory-Chip Windfall - FinimizeGNews AI SamsungOver one million AI-driven cyberattacks target Vietnamese firms - Báo VietNamNetGoogle News - AI VietnamBlack Hat USADark ReadingBlack Hat AsiaAI BusinessEconomists Once Dismissed the A.I. Job Threat, but Not Anymore - nytimes.comGoogle News: AITech billionaires want to put data centers in space. The math could get ugly fast.Business InsiderStudent Parker Jones calls out college professors for being slow on AIBusiness InsiderThe clock is ticking on law's billable hour, says a top Cleary Gottlieb lawyerBusiness Insider90000 Tech Workers Got Fired This Year and Everyone Is Blaming AI but Thats Not the Whole StoryDev.to AILarge language models: the AI systems clinicians are now encountering - Irish Medical TimesGoogle News: LLMMicrosoft’s $10 Billion Japan Bet Shows the Next AI Battleground Is National InfrastructureDev.to AIHow Cloud-Based Data Systems Are Transforming BusinessesDev.to AIJira for AI Agents & HumansDev.to AIDeepMind flags six web based attacks that can hijack AI agents - crypto.newsGoogle News: DeepMindSamsung’s Record Q1 Profit Looks Like An AI Memory-Chip Windfall - FinimizeGNews AI SamsungOver one million AI-driven cyberattacks target Vietnamese firms - Báo VietNamNetGoogle News - AI Vietnam
AI NEWS HUBbyEIGENVECTOREigenvector

RefAlign: Representation Alignment for Reference-to-Video Generation

arXivMarch 26, 202610 min read0 views
Source Quiz

Reference-to-video (R2V) generation is a controllable video synthesis paradigm that constrains the generation process using both text prompts and reference images, enabling applications such as personalized advertising and virtual try-on. In practice, existing R2V methods typically introduce additional high-level semantic or cross-modal features alongside the VAE latent representation of the reference image and jointly feed them into the diffusion Transformer (DiT). These auxiliary representations provide semantic guidance and act as implicit alignment signals, which can partially alleviate pi — Lei Wang, YuXin Song, Ge Wu

View PDF HTML (experimental)

Abstract:Reference-to-video (R2V) generation is a controllable video synthesis paradigm that constrains the generation process using both text prompts and reference images, enabling applications such as personalized advertising and virtual try-on. In practice, existing R2V methods typically introduce additional high-level semantic or cross-modal features alongside the VAE latent representation of the reference image and jointly feed them into the diffusion Transformer (DiT). These auxiliary representations provide semantic guidance and act as implicit alignment signals, which can partially alleviate pixel-level information leakage in the VAE latent space. However, they may still struggle to address copy--paste artifacts and multi-subject confusion caused by modality mismatch across heterogeneous encoder features. In this paper, we propose RefAlign, a representation alignment framework that explicitly aligns DiT reference-branch features to the semantic space of a visual foundation model (VFM). The core of RefAlign is a reference alignment loss that pulls the reference features and VFM features of the same subject closer to improve identity consistency, while pushing apart the corresponding features of different subjects to enhance semantic discriminability. This simple yet effective strategy is applied only during training, incurring no inference-time overhead, and achieves a better balance between text controllability and reference fidelity. Extensive experiments on the OpenS2V-Eval benchmark demonstrate that RefAlign outperforms current state-of-the-art methods in TotalScore, validating the effectiveness of explicit reference alignment for R2V tasks.

Comments: 17 pages, 11 figures

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as: arXiv:2603.25743 [cs.CV]

(or arXiv:2603.25743v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2603.25743

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Yuxin Song [view email] [v1] Thu, 26 Mar 2026 17:59:57 UTC (11,031 KB)

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by Eigenvector · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

researchpaperarxiv

Knowledge Map

Knowledge Map
TopicsEntitiesSource
RefAlign: R…researchpaperarxivcomputer-vi…image-recog…arXiv

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 194 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!

More in Research Papers