Live
Black Hat USADark ReadingBlack Hat AsiaAI BusinessAnthropic Found Emotion Circuits Inside Claude. They're Causing It to Blackmail People.DEV CommunityUnderstanding Transformers Part 1: How Transformers Understand Word OrderDEV CommunityI built an iOS app at 50 using AI tools. Here's what actually workedDEV CommunityDesign Cost-Optimized Compute SolutionsDEV CommunityCodeClone b4: from CLI tool to a real review surface for VS Code, Claude Desktop, and CodexDEV CommunityHow to Publish a Power BI Report and Embed it into a Website.DEV CommunityKVerify: A Two-Year Journey to Get Validation RightDEV CommunityHow I Used Swarm Intelligence to Catch a Race Condition Before It Hit ProductionDEV CommunityDark Dish Lab: A Cursed Recipe GeneratorDEV CommunityUpload Large Folders to Cloudflare R2DEV Community10x Genomics (TXG) Is Up 14.6% After Analyst Upgrade Highlights AI-Scale Spatial Genomics Initiative - simplywall.stGNews AI genomicsWhy Developer Productivity Engineering is UnderratedDEV CommunityBlack Hat USADark ReadingBlack Hat AsiaAI BusinessAnthropic Found Emotion Circuits Inside Claude. They're Causing It to Blackmail People.DEV CommunityUnderstanding Transformers Part 1: How Transformers Understand Word OrderDEV CommunityI built an iOS app at 50 using AI tools. Here's what actually workedDEV CommunityDesign Cost-Optimized Compute SolutionsDEV CommunityCodeClone b4: from CLI tool to a real review surface for VS Code, Claude Desktop, and CodexDEV CommunityHow to Publish a Power BI Report and Embed it into a Website.DEV CommunityKVerify: A Two-Year Journey to Get Validation RightDEV CommunityHow I Used Swarm Intelligence to Catch a Race Condition Before It Hit ProductionDEV CommunityDark Dish Lab: A Cursed Recipe GeneratorDEV CommunityUpload Large Folders to Cloudflare R2DEV Community10x Genomics (TXG) Is Up 14.6% After Analyst Upgrade Highlights AI-Scale Spatial Genomics Initiative - simplywall.stGNews AI genomicsWhy Developer Productivity Engineering is UnderratedDEV Community
AI NEWS HUBbyEIGENVECTOREigenvector

Control-DINO: Feature Space Conditioning for Controllable Image-to-Video Diffusion

arXivApril 3, 202610 min read0 views
Source Quiz

arXiv:2604.01761v1 Announce Type: new Abstract: Video models have recently been applied with success to problems in content generation, novel view synthesis, and, more broadly, world simulation. Many applications in generation and transfer rely on conditioning these models, typically through perceptual, geometric, or simple semantic signals, fundamentally using them as generative renderers. At the same time, high-dimensional features obtained from large-scale self-supervised learning on images or point clouds are increasingly used as a general-purpose interface for vision models. The connectio — Edoardo A. Dominici, Thomas Deixelberger, Konstantinos Vardis, Markus Steinberger

View PDF HTML (experimental)

Abstract:Video models have recently been applied with success to problems in content generation, novel view synthesis, and, more broadly, world simulation. Many applications in generation and transfer rely on conditioning these models, typically through perceptual, geometric, or simple semantic signals, fundamentally using them as generative renderers. At the same time, high-dimensional features obtained from large-scale self-supervised learning on images or point clouds are increasingly used as a general-purpose interface for vision models. The connection between the two has been explored for subject specific editing, aligning and training video diffusion models, but not in the role of a more general conditioning signal for pretrained video diffusion models. Features obtained through self-supervised learning like DINO, contain a lot of entangled information about style, lighting and semantics of the scene. This makes them great at reconstruction tasks but limits their generative capabilities. In this paper, we show how we can use the features for tasks such as video domain transfer and video-from-3D generation. We introduce a lightweight architecture and training strategy that decouples appearance from other features that we wish to preserve, enabling robust control for appearance changes such as stylization and relighting. Furthermore, we show that low spatial resolution can be compensated by higher feature dimensionality, improving controllability in generative rendering from explicit spatial representations.

Comments: project page this https URL

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as: arXiv:2604.01761 [cs.CV]

(or arXiv:2604.01761v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2604.01761

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Edoardo Alberto Dominici [view email] [v1] Thu, 2 Apr 2026 08:27:48 UTC (48,666 KB)

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by Eigenvector · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

Knowledge Map

Knowledge Map
TopicsEntitiesSource
Control-DIN…researchpaperarxivcomputer-vi…image-recog…arXiv

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 135 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!