Live
Black Hat USAAI BusinessBlack Hat AsiaAI BusinessSingle-cell imaging and machine learning reveal hidden coordination in algae's response to light stress - MSNGoogle News: Machine LearningGoogle Dramatically Upgrades Storage in Google AI Pro - Thurrott.comGoogle News: GeminiOpenAI Won't Save ARKK (BATS:ARKK) - seekingalpha.comGoogle News: OpenAIAI Can Describe Human Experiences But Lacks Experience In An Actual ‘Body’ - eurasiareview.comGoogle News: AIAI & Digital Tools on Construction Projects: Contract Risks to Address Before Peak Season - JD SupraGoogle News: AIAI Revolution: Action & Insight - businesstravelexecutive.comGoogle News: Machine LearningNavigating the Challenges of Cross-functional Teams: the Role of Governance and Common GoalsDEV Community[Side B] Pursuing OSS Quality Assurance with AI: Achieving 369 Tests, 97% Coverage, and GIL-Free CompatibilityDEV Community[Side A] Completely Defending Python from OOM Kills: The BytesIO Trap and D-MemFS 'Hard Quota' Design PhilosophyDEV CommunityFrom Attention Economy to Thinking Economy: The AI ChallengeDEV CommunityHow We're Approaching a County-Level Education Data System EngagementDEV CommunityI Built a Portable Text Editor for Windows — One .exe File, No Installation, Forever FreeDEV CommunityBlack Hat USAAI BusinessBlack Hat AsiaAI BusinessSingle-cell imaging and machine learning reveal hidden coordination in algae's response to light stress - MSNGoogle News: Machine LearningGoogle Dramatically Upgrades Storage in Google AI Pro - Thurrott.comGoogle News: GeminiOpenAI Won't Save ARKK (BATS:ARKK) - seekingalpha.comGoogle News: OpenAIAI Can Describe Human Experiences But Lacks Experience In An Actual ‘Body’ - eurasiareview.comGoogle News: AIAI & Digital Tools on Construction Projects: Contract Risks to Address Before Peak Season - JD SupraGoogle News: AIAI Revolution: Action & Insight - businesstravelexecutive.comGoogle News: Machine LearningNavigating the Challenges of Cross-functional Teams: the Role of Governance and Common GoalsDEV Community[Side B] Pursuing OSS Quality Assurance with AI: Achieving 369 Tests, 97% Coverage, and GIL-Free CompatibilityDEV Community[Side A] Completely Defending Python from OOM Kills: The BytesIO Trap and D-MemFS 'Hard Quota' Design PhilosophyDEV CommunityFrom Attention Economy to Thinking Economy: The AI ChallengeDEV CommunityHow We're Approaching a County-Level Education Data System EngagementDEV CommunityI Built a Portable Text Editor for Windows — One .exe File, No Installation, Forever FreeDEV Community

DiFlowDubber: Discrete Flow Matching for Automated Video Dubbing via Cross-Modal Alignment and Synchronization

arXivMarch 30, 202610 min read0 views
Source Quiz

arXiv:2603.14267v3 Announce Type: replace-cross Abstract: Video dubbing has broad applications in filmmaking, multimedia creation, and assistive speech technology. Existing approaches either train directly on limited dubbing datasets or adopt a two-stage pipeline that adapts pre-trained text-to-speech (TTS) models, which often struggle to produce expressive prosody, rich acoustic characteristics, and precise synchronization. To address these issues, we propose DiFlowDubber with a novel two-stage training framework that effectively transfers knowledge from a pre-trained TTS model to video-drive — Ngoc-Son Nguyen, Thanh V. T. Tran, Jeongsoo Choi, Hieu-Nghia Huynh-Nguyen, Truong-Son Hy, Van Nguyen

View PDF HTML (experimental)

Abstract:Video dubbing has broad applications in filmmaking, multimedia creation, and assistive speech technology. Existing approaches either train directly on limited dubbing datasets or adopt a two-stage pipeline that adapts pre-trained text-to-speech (TTS) models, which often struggle to produce expressive prosody, rich acoustic characteristics, and precise synchronization. To address these issues, we propose DiFlowDubber with a novel two-stage training framework that effectively transfers knowledge from a pre-trained TTS model to video-driven dubbing, with a discrete flow matching generative backbone. Specifically, we design a FaPro module that captures global prosody and stylistic cues from facial expressions and leverages this information to guide the modeling of subsequent speech attributes. To ensure precise speech-lip synchronization, we introduce a Synchronizer module that bridges the modality gap among text, video, and speech, thereby improving cross-modal alignment and generating speech that is temporally synchronized with lip movements. Experiments on two primary benchmark datasets demonstrate that DiFlowDubber outperforms previous methods across multiple metrics.

Comments: Accepted at CVPR 2026 Findings

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Sound (cs.SD)

Cite as: arXiv:2603.14267 [cs.CV]

(or arXiv:2603.14267v3 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2603.14267

arXiv-issued DOI via DataCite

Submission history

From: Ngoc Son Nguyen [view email] [v1] Sun, 15 Mar 2026 07:53:23 UTC (8,690 KB) [v2] Tue, 17 Mar 2026 05:01:44 UTC (8,684 KB) [v3] Fri, 27 Mar 2026 07:22:39 UTC (8,684 KB)

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by AI News Hub · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

researchpaperarxiv

Knowledge Map

Knowledge Map
TopicsEntitiesSource
DiFlowDubbe…researchpaperarxivaiartificial-…arXiv

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 175 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!

More in Research Papers