Show HN: SpeechSDK – free, open-source SDK that unifies all AI voice models
Article URL: https://www.speechsdk.dev/ Comments URL: https://news.ycombinator.com/item?id=47633441 Points: 3 # Comments: 0
The Unified Text-to-Speech SDK
The SpeechSDK is a free, open-source toolkit for building AI audio applications with multiple voice providers.
12
Providers
25+
Models
Built
For Production
Open Source
MIT License
Multi-Provider
One interface across OpenAI, ElevenLabs, Deepgram, Cartesia, Google, Mistral, Hume, and more. Unified model strings, consistent response format, BYO API keys.
Cross-Platform
Runs everywhere — Node.js, Edge runtimes, and the browser. Same API, zero platform-specific code.
Node.jsEdgeBrowser
Minimal Dependencies
Lightweight by design. Built-in retries, typed errors, and lazy base64 encoding. No heavy frameworks.
AI Engineering
For Production Voice Applications
Lazy base64 conversion
Only computes the format you access — uint8Array or base64 — and caches it. No unnecessary encoding or wasted memory.
Content-type awareness
The mediaType is read directly from each provider's response headers. You always know the actual audio format — MP3 from OpenAI, WAV from Cartesia, etc.
Custom fetch & Base URL
Every provider accepts a custom fetch and baseURL. Point at OpenAI-compatible proxies, Azure OpenAI, LiteLLM, or local models. Swap in undici, a proxy-aware fetch, or a mock.
Smart retries
Built-in retry with exponential backoff via p-retry. Retries 5xx and network errors automatically. 4xx errors (auth failures, bad requests) abort immediately — no wasted time.
Zero runtime dependencies
Only dependency is p-retry. The SDK uses raw fetch and Uint8Array — no heavy audio libraries, no provider SDK wrappers. Works anywhere fetch works.
Works seamlessly with Speech Gateway
Speech Gateway adds production infrastructure — queuing, quality processing, voice management, and analytics. One config change to connect. Coming Soon.
ProviderModel StringDefaultOpenAIopenai/gpt-4o-mini-ttsYesOpenAIopenai/tts-1—OpenAIopenai/tts-1-hd—ElevenLabselevenlabs/eleven_multilingual_v2YesElevenLabselevenlabs/eleven_v3—ElevenLabselevenlabs/eleven_flash_v2_5—ElevenLabselevenlabs/eleven_flash_v2—Deepgramdeepgram/aura-2YesCartesiacartesia/sonic-3YesHumehume/octave-2YesGooglegoogle/gemini-2.5-flash-preview-ttsYesGooglegoogle/gemini-2.5-pro-preview-tts—Fish Audiofish-audio/s2-proYesUnreal Speechunreal-speech/defaultYesMurfmurf/GEN2YesResembleresemble/defaultYesfalfal-ai/—Mistralmistral/voxtral-mini-tts-2603Yes
- Pass just the provider name to use its default model — e.g. model: 'openai' resolves to openai/gpt-4o-mini-tts.
Frequently asked questions
Each provider has its own SDK, request format, auth pattern, and response shape. SpeechSDK gives you one interface for all of them — same function call, same result type, same error handling. Switch providers by simply changing a model string.
One SDK, every provider. Add text-to-speech to your app in minutes with a unified, open-source interface.
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
modelopen-source
Does GPT-2 Have a Fear Direction?
Anthropic dropped a paper this morning showing that Claude Sonnet 4.5 has steerable emotion representations. Actual directions in activation space that, when injected, shift the model's behavior in predictable ways. They found a non-monotonic anger flip: push the steering vector hard enough and the model will flip to something qualitatively different than anger. The paper only covered their very large, heavily instruction tuned model. This paper is a write-up on the same same experiment at a tiny scale. The Setup: I generated 40 situational prompt pairs to extract a fewer direction via difference-in-means. No emotional words for the prompts and the contrast is entirely situational. Ex: standing at the edge of a rooftop versus standing at the edge of a meadow, alone in a parking garage at m

Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Models

Does GPT-2 Have a Fear Direction?
Anthropic dropped a paper this morning showing that Claude Sonnet 4.5 has steerable emotion representations. Actual directions in activation space that, when injected, shift the model's behavior in predictable ways. They found a non-monotonic anger flip: push the steering vector hard enough and the model will flip to something qualitatively different than anger. The paper only covered their very large, heavily instruction tuned model. This paper is a write-up on the same same experiment at a tiny scale. The Setup: I generated 40 situational prompt pairs to extract a fewer direction via difference-in-means. No emotional words for the prompts and the contrast is entirely situational. Ex: standing at the edge of a rooftop versus standing at the edge of a meadow, alone in a parking garage at m

Google DeepMind s Research Lets an LLM Rewrite Its Own Game Theory Algorithms — And It Outperformed the Experts
Designing algorithms for Multi-Agent Reinforcement Learning (MARL) in imperfect-information games — scenarios where players act sequentially and cannot see each other s private information, like poker — has historically relied on manual iteration. Researchers identify weighting schemes, discounting rules, and equilibrium solvers through intuition and trial-and-error. Google DeepMind researchers proposes AlphaEvolve, an LLM-powered evolutionary coding agent [ ] The post Google DeepMind s Research Lets an LLM Rewrite Its Own Game Theory Algorithms — And It Outperformed the Experts appeared first on MarkTechPost .


Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!