LongCat-AudioDiT: High-Fidelity Diffusion Text-to-Speech in the Waveform Latent Space
Hey there, little explorer! Guess what? Scientists made a super cool new toy!
Imagine you have a magic crayon that can draw sounds. This new toy, called LongCat-AudioDiT, is like that magic crayon for computers!
Before, computers used to draw sounds in a tricky way, like drawing a picture of a sound first. But LongCat-AudioDiT is smarter! It draws the sound right away, like drawing a real sound wave!
This makes the computer talk much, much better, almost like a real person! It's super good at making voices sound just like other people's voices, even if it's never heard them before. How cool is that?!
arXiv:2603.29339v1 Announce Type: cross Abstract: We present LongCat-AudioDiT, a novel, non-autoregressive diffusion-based text-to-speech (TTS) model that achieves state-of-the-art (SOTA) performance. Unlike previous methods that rely on intermediate acoustic representations such as mel-spectrograms, the core innovation of LongCat-AudioDiT lies in operating directly within the waveform latent space. This approach effectively mitigates compounding errors and drastically simplifies the TTS pipeline, requiring only a waveform variational autoencoder (Wav-VAE) and a diffusion backbone. Furthermore, we introduce two critical improvements to the inference process: first, we identify and rectify a long-standing training-inference mismatch; second, we replace traditional classifier-free guidance w
View PDF HTML (experimental)
Abstract:We present LongCat-AudioDiT, a novel, non-autoregressive diffusion-based text-to-speech (TTS) model that achieves state-of-the-art (SOTA) performance. Unlike previous methods that rely on intermediate acoustic representations such as mel-spectrograms, the core innovation of LongCat-AudioDiT lies in operating directly within the waveform latent space. This approach effectively mitigates compounding errors and drastically simplifies the TTS pipeline, requiring only a waveform variational autoencoder (Wav-VAE) and a diffusion backbone. Furthermore, we introduce two critical improvements to the inference process: first, we identify and rectify a long-standing training-inference mismatch; second, we replace traditional classifier-free guidance with adaptive projection guidance to elevate generation quality. Experimental results demonstrate that, despite the absence of complex multi-stage training pipelines or high-quality human-annotated datasets, LongCat-AudioDiT achieves SOTA zero-shot voice cloning performance on the Seed benchmark while maintaining competitive intelligibility. Specifically, our largest variant, LongCat-AudioDiT-3.5B, outperforms the previous SOTA model (Seed-TTS), improving the speaker similarity (SIM) scores from 0.809 to 0.818 on Seed-ZH, and from 0.776 to 0.797 on Seed-Hard. Finally, through comprehensive ablation studies and systematic analysis, we validate the effectiveness of our proposed modules. Notably, we investigate the interplay between the Wav-VAE and the TTS backbone, revealing the counterintuitive finding that superior reconstruction fidelity in the Wav-VAE does not necessarily lead to better overall TTS performance. Code and model weights are released to foster further research within the speech community.
Comments: Code and model weights are available at this https URL
Subjects:
Sound (cs.SD); Audio and Speech Processing (eess.AS)
Cite as: arXiv:2603.29339 [cs.SD]
(or arXiv:2603.29339v1 [cs.SD] for this version)
https://doi.org/10.48550/arXiv.2603.29339
arXiv-issued DOI via DataCite (pending registration)
Submission history
From: Detai Xin [view email] [v1] Tue, 31 Mar 2026 07:07:52 UTC (450 KB)
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
modelbenchmarktraining
AI offensive cyber capabilities are doubling every six months, safety researchers find
AI models are rapidly improving at exploiting security vulnerabilities. According to a new study, their offensive cyber capability has been doubling every 5.7 months since 2024, with Opus 4.6 and GPT-5.3 Codex now solving tasks that take human experts about three hours. The article AI offensive cyber capabilities are doubling every six months, safety researchers find appeared first on The Decoder .
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Models

AI offensive cyber capabilities are doubling every six months, safety researchers find
AI models are rapidly improving at exploiting security vulnerabilities. According to a new study, their offensive cyber capability has been doubling every 5.7 months since 2024, with Opus 4.6 and GPT-5.3 Codex now solving tasks that take human experts about three hours. The article AI offensive cyber capabilities are doubling every six months, safety researchers find appeared first on The Decoder .

AI benchmarks systematically ignore how humans disagree, Google study finds
A Google study finds that the standard three to five human raters per test example often aren't enough for reliable AI benchmarks, and that splitting your annotation budget the right way matters just as much as the budget itself. The article AI benchmarks systematically ignore how humans disagree, Google study finds appeared first on The Decoder .




Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!