Show HN: TurboQuant-WASM – Google's vector quantization in the browser
Comments
Experimental WASM + relaxed SIMD build of botirk38/turboquant for browsers and Node.js.
Based on the paper "TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate" (Google Research, ICLR 2026).
Live Demo — vector search, image similarity, and 3D Gaussian Splatting compression running in the browser.
What this adds
-
npm package with embedded WASM — npm install turboquant-wasm
-
Relaxed SIMD — @mulAdd FMA maps to f32x4.relaxed_madd
-
SIMD-vectorized QJL sign packing/unpacking and scaling
-
TypeScript API — TurboQuant.init() / encode() / decode() / dot()
-
Golden-value tests — byte-identical output with the reference Zig implementation
Browser Requirements
The WASM binary uses relaxed SIMD instructions:
Runtime Minimum Version
Chrome 114+
Firefox 128+
Safari 18+
Node.js 20+
Quick Start
import { TurboQuant } from "turboquant-wasm";
const tq = await TurboQuant.init({ dim: 1024, seed: 42 });
// Compress a vector (~4.5 bits/dim, ~6x compression) const compressed = tq.encode(myFloat32Array);
// Decode back const decoded = tq.decode(compressed);
// Fast dot product without decoding const score = tq.dot(queryVector, compressed);
tq.destroy();`
API
class TurboQuant { static async init(config: { dim: number; seed: number }): Promise; encode(vector: Float32Array): Uint8Array; decode(compressed: Uint8Array): Float32Array; dot(query: Float32Array, compressed: Uint8Array): number; destroy(): void; }class TurboQuant { static async init(config: { dim: number; seed: number }): Promise; encode(vector: Float32Array): Uint8Array; decode(compressed: Uint8Array): Float32Array; dot(query: Float32Array, compressed: Uint8Array): number; destroy(): void; }Building
# Run tests zig test -target aarch64-macos src/turboquant.zig# Run tests zig test -target aarch64-macos src/turboquant.zigFull npm build (zig -> wasm-opt -> base64 embed -> bun + tsc)
bun run build
Build WASM only
bun run build:zig`
Requires Zig 0.15.2 and Bun.
Quality
Encoding preserves inner products — verified by golden-value tests and distortion bounds:
-
MSE decreases with dimension (unit vectors)
-
Bits/dim is ~4.5 (payload only, excluding 22-byte header)
-
Dot product preservation — mean absolute error < 1.0 for unit vectors at dim=128
-
Bit-identical output with botirk38/turboquant for same input + seed
Credits
-
botirk38/turboquant — original Zig implementation
-
TurboQuant paper (Google Research, ICLR 2026) — algorithm design
License
MIT
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
quantization
Kokoro TTS running on-device, CPU-only, 20x realtime!!!
I wanted a reading app where you could read, read and listen or just listen to books with word-by-word highlighting synced to TTS and i wanted the voice to actually sound good. This turned out to be a really hard challenge with Kokoro on iOS, here's what I ran into: Using MLX Swift is great but uses Metal. iOS kills Metal access the moment you background the app. If your use case needs background audio, this is a dead end. ONNX Runtime on CPU fixes the background problem, but the monolithic Kokoro model only runs at 2-3x realtime. After 30 minutes of sustained generation my phone was scorching hot. What actually worked: I split the monolithic model into a multi-stage pipeline and replaced part of the synthesis with native code on Apple's Accelerate framework. That got it to 20x realtime on

Gemma 4 is a KV_cache Pig
Ignoring the 8 bit size of Nvidia’s marketed 4 bit quantization of the dense model… The dense model KV cache architecture uses 3x or more the memory than what I have seen with other models. It seems like the big choice was 256 head dim instead of 128. I am looking at 490KB per 8 bit token of KV cache versus 128KB on Qwen3. I am running the nvidia weights at 4 bit on an rtx pro 6000 with 96GB of RAM and 8 bit kv cache and still only have room for 115k tokens. I was surprised is all. The model scales well in vllm and seems quite smart. submitted by /u/IngeniousIdiocy [link] [comments]

KV Cache Is Why Your Model Fit Until It Did Not
The model loaded. The first prompt worked. Then longer prompts or multiple users showed up, and suddenly the same setup stopped feeling stable. A lot of the time, that is KV cache. What KV cache changes more context means more memory tied up during generation more concurrent requests make the problem worse a setup that fits one short prompt can fail on real workloads people blame the model when the cache is the thing quietly growing The common mistake People test with one short input and assume the model fits . Then product prompts get longer, users stack up, or batching gets turned on. The model did not change. The memory footprint did. When KV cache becomes the real problem Short prompt, single user: Everything looks easy Longer prompt: Latency rises and memory margin shrinks Longer prom
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Open Source AI

Help running Qwen3-Coder-Next TurboQuant (TQ3) model
I found a TQ3-quantized version of Qwen3-Coder-Next here: https://huggingface.co/edwardyoon79/Qwen3-Coder-Next-TQ3_0 According to the page, this model requires a compatible inference engine that supports TurboQuant. It also provides a command, but it doesn’t clearly specify which version or fork of llama.cpp should be used (or maybe I missed it). llama-server I’ve tried the following llama.cpp forks that claim to support TQ3, but none of them worked for me: https://github.com/TheTom/llama-cpp-turboquant https://github.com/turbo-tan/llama.cpp-tq3 https://github.com/drdotdot/llama.cpp-turbo3-tq3 If anyone has successfully run this model, I’d really appreciate it if you could share how you did it. submitted by /u/UnluckyTeam3478 [link] [comments]


Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!