Models model benchmark training release announce analysis

LongCat-AudioDiT: High-Fidelity Diffusion Text-to-Speech in the Waveform Latent Space

arXiv eess.ASby [Submitted on 31 Mar 2026]April 1, 20262 min read2 views

🧒Explain Like I'm 5Simple language

Hey there, little explorer! Guess what? Scientists made a super cool new toy!

Imagine you have a magic crayon that can draw sounds. This new toy, called LongCat-AudioDiT, is like that magic crayon for computers!

Before, computers used to draw sounds in a tricky way, like drawing a picture of a sound first. But LongCat-AudioDiT is smarter! It draws the sound right away, like drawing a real sound wave!

This makes the computer talk much, much better, almost like a real person! It's super good at making voices sound just like other people's voices, even if it's never heard them before. How cool is that?!

arXiv:2603.29339v1 Announce Type: cross Abstract: We present LongCat-AudioDiT, a novel, non-autoregressive diffusion-based text-to-speech (TTS) model that achieves state-of-the-art (SOTA) performance. Unlike previous methods that rely on intermediate acoustic representations such as mel-spectrograms, the core innovation of LongCat-AudioDiT lies in operating directly within the waveform latent space. This approach effectively mitigates compounding errors and drastically simplifies the TTS pipeline, requiring only a waveform variational autoencoder (Wav-VAE) and a diffusion backbone. Furthermore, we introduce two critical improvements to the inference process: first, we identify and rectify a long-standing training-inference mismatch; second, we replace traditional classifier-free guidance w

View PDF HTML (experimental)

Abstract:We present LongCat-AudioDiT, a novel, non-autoregressive diffusion-based text-to-speech (TTS) model that achieves state-of-the-art (SOTA) performance. Unlike previous methods that rely on intermediate acoustic representations such as mel-spectrograms, the core innovation of LongCat-AudioDiT lies in operating directly within the waveform latent space. This approach effectively mitigates compounding errors and drastically simplifies the TTS pipeline, requiring only a waveform variational autoencoder (Wav-VAE) and a diffusion backbone. Furthermore, we introduce two critical improvements to the inference process: first, we identify and rectify a long-standing training-inference mismatch; second, we replace traditional classifier-free guidance with adaptive projection guidance to elevate generation quality. Experimental results demonstrate that, despite the absence of complex multi-stage training pipelines or high-quality human-annotated datasets, LongCat-AudioDiT achieves SOTA zero-shot voice cloning performance on the Seed benchmark while maintaining competitive intelligibility. Specifically, our largest variant, LongCat-AudioDiT-3.5B, outperforms the previous SOTA model (Seed-TTS), improving the speaker similarity (SIM) scores from 0.809 to 0.818 on Seed-ZH, and from 0.776 to 0.797 on Seed-Hard. Finally, through comprehensive ablation studies and systematic analysis, we validate the effectiveness of our proposed modules. Notably, we investigate the interplay between the Wav-VAE and the TTS backbone, revealing the counterintuitive finding that superior reconstruction fidelity in the Wav-VAE does not necessarily lead to better overall TTS performance. Code and model weights are released to foster further research within the speech community.

Comments: Code and model weights are available at this https URL

Subjects:

Sound (cs.SD); Audio and Speech Processing (eess.AS)

Cite as: arXiv:2603.29339 [cs.SD]

(or arXiv:2603.29339v1 [cs.SD] for this version)

https://doi.org/10.48550/arXiv.2603.29339

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Detai Xin [view email] [v1] Tue, 31 Mar 2026 07:07:52 UTC (450 KB)

Original source

arXiv eess.AS

https://arxiv.org/abs/2603.29339

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

modelbenchmarktraining

ModelsRecent

Human context missing: AI benchmarks are flawed, researcher explains why - digit.in

Human context missing: AI benchmarks are flawed, researcher explains why digit.in

GNews AI benchmark

1m2 days ago

ModelsFresh

AI benchmarks systematically ignore how humans disagree, Google study finds - the-decoder.com

AI benchmarks systematically ignore how humans disagree, Google study finds the-decoder.com

GNews AI Google

1mabout 2 hours ago

ModelsLive

AI offensive cyber capabilities are doubling every six months, safety researchers find

AI models are rapidly improving at exploiting security vulnerabilities. According to a new study, their offensive cyber capability has been doubling every 5.7 months since 2024, with Opus 4.6 and GPT-5.3 Codex now solving tasks that take human experts about three hours. The article AI offensive cyber capabilities are doubling every six months, safety researchers find appeared first on The Decoder .

The Decoder

1mabout 2 hours ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 130 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Models

ModelsRecent

Human context missing: AI benchmarks are flawed, researcher explains why - digit.in

Human context missing: AI benchmarks are flawed, researcher explains why digit.in

GNews AI benchmark

1m2 days ago

ModelsFresh

AI benchmarks systematically ignore how humans disagree, Google study finds - the-decoder.com

AI benchmarks systematically ignore how humans disagree, Google study finds the-decoder.com

GNews AI Google

1mabout 2 hours ago

ModelsLive

AI offensive cyber capabilities are doubling every six months, safety researchers find

The Decoder

1mabout 2 hours ago

ModelsFresh

AI benchmarks systematically ignore how humans disagree, Google study finds

A Google study finds that the standard three to five human raters per test example often aren't enough for reliable AI benchmarks, and that splitting your annotation budget the right way matters just as much as the budget itself. The article AI benchmarks systematically ignore how humans disagree, Google study finds appeared first on The Decoder .

The Decoder

1mabout 2 hours ago