Open Source AI github trending open-source

🔥 microsoft/VibeVoice

GitHub Trendingby microsoftMarch 29, 20265 min read2 views

🧒Explain Like I'm 5Simple language

Hey there, little superstar! ✨

Imagine you have a magic friend named VibeVoice! This friend is super smart with voices.

Sometimes, you talk, and VibeVoice writes down everything you say, like a super-fast helper. It even knows who is talking and when! That's like a special listening ear. 👂📝

Other times, you type words, and VibeVoice can speak them out loud for you, in different fun voices! Like a talking book! 📖🗣️

Lots of smart grown-ups are helping VibeVoice get even better and share its magic with everyone. It's like building a super cool toy car together so all your friends can play! Isn't that neat? 🎉

Open-Source Frontier Voice AI — Trending on GitHub today with 538 new stars.

📰 News

2026-03-29: 🎉 VibeVoice-ASR is being adopted by the open-source community! Vibing, a voice-powered input method, is now built on top of VibeVoice-ASR. Download: macOS | Windows

1.mov

2026-03-06: 🚀 VibeVoice ASR is now part of a Transformers release! You can now use our speech recognition model directly through the Hugging Face Transformers library for seamless integration into your projects.

2026-01-21: 📣 We open-sourced VibeVoice-ASR, a unified speech-to-text model designed to handle 60-minute long-form audio in a single pass, generating structured transcriptions containing Who (Speaker), When (Timestamps), and What (Content), with support for User-Customized Context. Try it in Playground.

⭐️ VibeVoice-ASR is natively multilingual, supporting over 50 languages — check the supported languages for details.
🔥 The VibeVoice-ASR finetuning code is now available!
⚡️ vLLM inference is now supported for faster inference; see vllm-asr for more details.
📑 VibeVoice-ASR Technique Report is available.

2025-12-16: 📣 We added experimental speakers to VibeVoice‑Realtime‑0.5B for exploration, including multilingual voices in nine languages (DE, FR, IT, JP, KR, NL, PL, PT, ES) and 11 distinct English style voices. Try it. More speaker types will be added over time.

2025-12-03: 📣 We open-sourced VibeVoice‑Realtime‑0.5B, a real‑time text‑to‑speech model that supports streaming text input and robust long-form speech generation. Try it on Colab.

2025-09-05: VibeVoice is an open-source research framework intended to advance collaboration in the speech synthesis community. After release, we discovered instances where the tool was used in ways inconsistent with the stated intent. Since responsible use of AI is one of Microsoft’s guiding principles, we have removed the VibeVoice-TTS code from this repository.

2025-08-25: 📣 We open-sourced VibeVoice-TTS, a long-form multi-speaker text-to-speech model that can synthesize speech up to 90 minutes long with up to 4 distinct speakers. — accepted as an Oral at ICLR 2026! 🔥

Overview

VibeVoice is a family of open-source frontier voice AI models that includes both Text-to-Speech (TTS) and Automatic Speech Recognition (ASR) models.

A core innovation of VibeVoice is its use of continuous speech tokenizers (Acoustic and Semantic) operating at an ultra-low frame rate of 7.5 Hz. These tokenizers efficiently preserve audio fidelity while significantly boosting computational efficiency for processing long sequences. VibeVoice employs a next-token diffusion framework, leveraging a Large Language Model (LLM) to understand textual context and dialogue flow, and a diffusion head to generate high-fidelity acoustic details.

For more information, demos, and examples, please visit our Project Page.

Model Weight Quick Try

VibeVoice-ASR-7B HF Link Playground

VibeVoice-TTS-1.5B HF Link Disabled

VibeVoice-Realtime-0.5B HF Link Colab

Models

1. 📖 VibeVoice-ASR - Long-form Speech Recognition

VibeVoice-ASR is a unified speech-to-text model designed to handle 60-minute long-form audio in a single pass, generating structured transcriptions containing Who (Speaker), When (Timestamps), and What (Content), with support for Customized Hotwords.

🕒 60-minute Single-Pass Processing: Unlike conventional ASR models that slice audio into short chunks (often losing global context), VibeVoice ASR accepts up to 60 minutes of continuous audio input within 64K token length. This ensures consistent speaker tracking and semantic coherence across the entire hour.
👤 Customized Hotwords: Users can provide customized hotwords (e.g., specific names, technical terms, or background info) to guide the recognition process, significantly improving accuracy on domain-specific content.
📝 Rich Transcription (Who, When, What): The model jointly performs ASR, diarization, and timestamping, producing a structured output that indicates who said what and when.

📖 Documentation | 🤗 Hugging Face | 🎮 Playground | 🛠️ Finetuning | 📊 Paper

small.mp4

2. 🎙️ VibeVoice-TTS - Long-form Multi-speaker TTS

Best for: Long-form conversational audio, podcasts, multi-speaker dialogues

⏱️ 90-minute Long-form Generation: Synthesizes conversational/single-speaker speech up to 90 minutes in a single pass, maintaining speaker consistency and semantic coherence throughout.
👥 Multi-speaker Support: Supports up to 4 distinct speakers in a single conversation, with natural turn-taking and speaker consistency across long dialogues.
🎭 Expressive Speech: Generates expressive, natural-sounding speech that captures conversational dynamics and emotional nuances.
🌐 Multi-lingual Support: Supports English, Chinese and other languages.

📖 Documentation | 🤗 Hugging Face | 📊 Paper

English

ES_._3.mp4

Chinese

default.mp4

Cross-Lingual

1p_EN2CH.mp4

Spontaneous Singing

2p_see_u_again.mp4

Long Conversation with 4 people

4p_climate_45min.mp4

3. ⚡ VibeVoice-Streaming - Real-time Streaming TTS

VibeVoice-Realtime is a lightweight real‑time text-to-speech model supporting streaming text input and robust long-form speech generation.

Parameter size: 0.5B (deployment-friendly)
Real-time TTS (~300 milliseconds first audible latency)
Streaming text input
Robust long-form speech generation (~10 minutes)

📖 Documentation | 🤗 Hugging Face | 🚀 Colab

VibeVoice_Realtime.mp4

Contributing

Please see CONTRIBUTING.md for detailed contribution guidelines.

⚠️ Risks and Limitations

While efforts have been made to optimize it through various techniques, it may still produce outputs that are unexpected, biased, or inaccurate. VibeVoice inherits any biases, errors, or omissions produced by its base model (specifically, Qwen2.5 1.5b in this release). Potential for Deepfakes and Disinformation: High-quality synthetic speech can be misused to create convincing fake audio content for impersonation, fraud, or spreading disinformation. Users must ensure transcripts are reliable, check content accuracy, and avoid using generated content in misleading ways. Users are expected to use the generated content and to deploy the models in a lawful manner, in full compliance with all applicable laws and regulations in the relevant jurisdictions. It is best practice to disclose the use of AI when sharing AI-generated content.

We do not recommend using VibeVoice in commercial or real-world applications without further testing and development. This model is intended for research and development purposes only. Please use responsibly.

Star History

Original source

GitHub Trending

https://github.com/microsoft/VibeVoice

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

githubtrendingopen-source

Frontier ResearchLive

"I Built a Web Browser from Scratch in 42 Days — No Libraries, Just Node.js"

I Built a Web Browser from Scratch in 42 Days 42 days ago I made a decision. I wanted to understand how the internet actually works. Not just use it. Not just build on top of it. Actually understand it — at the wire level. So I started building a web browser from scratch. In Node.js. No external libraries. Every line written by hand. I called it Courage. What Courage can do today ... Parse URLs into protocol, host, port, path Open raw TCP and TLS connections Build and send HTTP GET requests Parse HTTP responses including chunked encoding Tokenize raw HTML character by character Build a DOM tree using a stack Match CSS rules to DOM nodes Calculate layout (x, y, width, height) for every element Paint rectangles and text on a Canvas using Electron Execute JavaScript via eval() Navigate with b

DEV Community

2m38 minutes ago

Models

Open-Source AI Gains Ground as Rising Costs Push Shift to Smaller Models - EE Times

Open-Source AI Gains Ground as Rising Costs Push Shift to Smaller Models EE Times

GNews AI open source

1m9 days ago

ReleasesFresh

Google Just Released Gemma 4: Why This Open-Source AI is a Game Changer - Geeky Gadgets

Google Just Released Gemma 4: Why This Open-Source AI is a Game Changer Geeky Gadgets

GNews AI open source

1mabout 7 hours ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 140 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Open Source AI

Open Source AIFresh

Help running Qwen3-Coder-Next TurboQuant (TQ3) model

I found a TQ3-quantized version of Qwen3-Coder-Next here: https://huggingface.co/edwardyoon79/Qwen3-Coder-Next-TQ3_0 According to the page, this model requires a compatible inference engine that supports TurboQuant. It also provides a command, but it doesn’t clearly specify which version or fork of llama.cpp should be used (or maybe I missed it). llama-server I’ve tried the following llama.cpp forks that claim to support TQ3, but none of them worked for me: https://github.com/TheTom/llama-cpp-turboquant https://github.com/turbo-tan/llama.cpp-tq3 https://github.com/drdotdot/llama.cpp-turbo3-tq3 If anyone has successfully run this model, I’d really appreciate it if you could share how you did it. submitted by /u/UnluckyTeam3478 [link] [comments]

Reddit r/LocalLLaMA

1mabout 8 hours ago

Open Source AIFresh

Show HN: TurboQuant-WASM – Google's vector quantization in the browser

Comments

Hacker News

2mabout 5 hours ago

Open Source AIFresh

🔥 block/goose

an open source, extensible AI agent that goes beyond code suggestions - install, execute, edit, and test with any LLM — Trending on GitHub today with 947 new stars.