Live
Black Hat USAAI BusinessBlack Hat AsiaAI Business$1,700 liquid-cooled phone can run GTA V at up to 100 FPS, Red Dead 2 at 50+ FPS via emulation — Redmagic 11 Pro packs 24 GB of RAM and pulls more than 40W at peak loadtomshardware.comGoogle's Head Of Learning Says AI Can't Solve Education's Real Problem - ForbesGNews AI GoogleApple approves drivers that let AMD and Nvidia eGPUs run on Mac — software designed for AI, though, and not built for gamingtomshardware.comShow HN: Vektor – local-first associative memory for AI agentsHacker News AI TopWorkers are feeling AI anxiety — and that they might be training their replacementsBusiness InsiderYour neighbor just got a home security system, but should you be worried? It s inherently a little creepy says surveillance expertFortune TechWe Asked A.I. to Build Us a Video Game. The Result Was Strange. - slate.comGoogle News: AISteam geeft mogelijk indicatie van framerate games op je specifieke hardwareTweakers.netStudy: The AI Body GapHacker News AI TopI’m Worried About the Helpless AI Disruptors of the Future - GizmodoGoogle News: AII m Worried About the Helpless AI Disruptors of the FutureGizmodoDid Strong Earnings and AI Cooling Momentum Just Shift Modine Manufacturing's (MOD) Investment Narrative? - simplywall.stGNews AI manufacturingBlack Hat USAAI BusinessBlack Hat AsiaAI Business$1,700 liquid-cooled phone can run GTA V at up to 100 FPS, Red Dead 2 at 50+ FPS via emulation — Redmagic 11 Pro packs 24 GB of RAM and pulls more than 40W at peak loadtomshardware.comGoogle's Head Of Learning Says AI Can't Solve Education's Real Problem - ForbesGNews AI GoogleApple approves drivers that let AMD and Nvidia eGPUs run on Mac — software designed for AI, though, and not built for gamingtomshardware.comShow HN: Vektor – local-first associative memory for AI agentsHacker News AI TopWorkers are feeling AI anxiety — and that they might be training their replacementsBusiness InsiderYour neighbor just got a home security system, but should you be worried? It s inherently a little creepy says surveillance expertFortune TechWe Asked A.I. to Build Us a Video Game. The Result Was Strange. - slate.comGoogle News: AISteam geeft mogelijk indicatie van framerate games op je specifieke hardwareTweakers.netStudy: The AI Body GapHacker News AI TopI’m Worried About the Helpless AI Disruptors of the Future - GizmodoGoogle News: AII m Worried About the Helpless AI Disruptors of the FutureGizmodoDid Strong Earnings and AI Cooling Momentum Just Shift Modine Manufacturing's (MOD) Investment Narrative? - simplywall.stGNews AI manufacturing
AI NEWS HUBbyEIGENVECTOREigenvector

LongCat-AudioDiT: High-Fidelity Diffusion Text-to-Speech in the Waveform Latent Space

arXiv eess.ASby [Submitted on 31 Mar 2026]April 1, 20262 min read2 views
Source Quiz
🧒Explain Like I'm 5Simple language

Hey there, little explorer! Guess what? Scientists made a super cool new toy!

Imagine you have a magic crayon that can draw sounds. This new toy, called LongCat-AudioDiT, is like that magic crayon for computers!

Before, computers used to draw sounds in a tricky way, like drawing a picture of a sound first. But LongCat-AudioDiT is smarter! It draws the sound right away, like drawing a real sound wave!

This makes the computer talk much, much better, almost like a real person! It's super good at making voices sound just like other people's voices, even if it's never heard them before. How cool is that?!

arXiv:2603.29339v1 Announce Type: cross Abstract: We present LongCat-AudioDiT, a novel, non-autoregressive diffusion-based text-to-speech (TTS) model that achieves state-of-the-art (SOTA) performance. Unlike previous methods that rely on intermediate acoustic representations such as mel-spectrograms, the core innovation of LongCat-AudioDiT lies in operating directly within the waveform latent space. This approach effectively mitigates compounding errors and drastically simplifies the TTS pipeline, requiring only a waveform variational autoencoder (Wav-VAE) and a diffusion backbone. Furthermore, we introduce two critical improvements to the inference process: first, we identify and rectify a long-standing training-inference mismatch; second, we replace traditional classifier-free guidance w

View PDF HTML (experimental)

Abstract:We present LongCat-AudioDiT, a novel, non-autoregressive diffusion-based text-to-speech (TTS) model that achieves state-of-the-art (SOTA) performance. Unlike previous methods that rely on intermediate acoustic representations such as mel-spectrograms, the core innovation of LongCat-AudioDiT lies in operating directly within the waveform latent space. This approach effectively mitigates compounding errors and drastically simplifies the TTS pipeline, requiring only a waveform variational autoencoder (Wav-VAE) and a diffusion backbone. Furthermore, we introduce two critical improvements to the inference process: first, we identify and rectify a long-standing training-inference mismatch; second, we replace traditional classifier-free guidance with adaptive projection guidance to elevate generation quality. Experimental results demonstrate that, despite the absence of complex multi-stage training pipelines or high-quality human-annotated datasets, LongCat-AudioDiT achieves SOTA zero-shot voice cloning performance on the Seed benchmark while maintaining competitive intelligibility. Specifically, our largest variant, LongCat-AudioDiT-3.5B, outperforms the previous SOTA model (Seed-TTS), improving the speaker similarity (SIM) scores from 0.809 to 0.818 on Seed-ZH, and from 0.776 to 0.797 on Seed-Hard. Finally, through comprehensive ablation studies and systematic analysis, we validate the effectiveness of our proposed modules. Notably, we investigate the interplay between the Wav-VAE and the TTS backbone, revealing the counterintuitive finding that superior reconstruction fidelity in the Wav-VAE does not necessarily lead to better overall TTS performance. Code and model weights are released to foster further research within the speech community.

Comments: Code and model weights are available at this https URL

Subjects:

Sound (cs.SD); Audio and Speech Processing (eess.AS)

Cite as: arXiv:2603.29339 [cs.SD]

(or arXiv:2603.29339v1 [cs.SD] for this version)

https://doi.org/10.48550/arXiv.2603.29339

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Detai Xin [view email] [v1] Tue, 31 Mar 2026 07:07:52 UTC (450 KB)

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by Eigenvector · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

Knowledge Map

Knowledge Map
TopicsEntitiesSource
LongCat-Aud…modelbenchmarktrainingreleaseannounceanalysisarXiv eess.…

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 130 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!

More in Models