Google's TurboQuant saves memory, but won't save us from DRAM-pricing hell
<h4>Chocolate Factory’s compression tech clears the way to cheaper AI inference, not more affordable memory</h4> <p>When Google unveiled <a target="_blank" rel="nofollow" href="https://research.google/blog/turboquant-redefining-ai-efficiency-with-extreme-compression/">TurboQuant</a>, an AI data compression technology that promises to slash the amount of memory required to serve models, many hoped it would help with a memory shortage that has seen prices triple since last year. Not so much.…</p>
When Google unveiled TurboQuant, an AI data compression technology that promises to slash the amount of memory required to serve models, many hoped it would help with a memory shortage that has seen prices triple since last year. Not so much.
TurboQuant isn't the savior you might be hoping for. Having said that, the underlying technology is still worth a closer look as it has major implications for model devs and inference providers.
What the heck is TurboQuant
Detailed by Google researchers in a recent blog post, TurboQuant is essentially a method of compressing data used in generative AI from higher to lower precisions, an approach commonly referred to as quantization.
According to researchers, TurboQuant has the potential to cut memory consumption during inference by at least 6x, a bold claim at a time when DRAM and NAND prices are at record highs.
However, unlike most quantization methods, TurboQuant doesn't shrink the model. Instead it aims to reduce the amount of memory required to store the key value (KV) caches used to maintain context during LLM inference.
In a nutshell, the KV cache is a bit like the model's short-term memory. During a chat session, for example, the KV cache is how the model keeps track of your conversation.
Where things get tricky is that these KV caches can pile up quite quickly, often consuming more memory than the model itself.
Usually, these KV caches are stored at 16-bit precision, so if you can shrink the number of bits used to store them to eight or even four bits, you can reduce the memory required by a factor of 2x to 4x.
While TurboQuant has certainly brought attention to KV cache quantization, the overarching idea isn't new. In fact, it's quite common for inference engines to store KV caches at FP8 for these reasons.
However, this kind of quantization isn't free. Lower precision means fewer bits to store key values and therefore less memory. These quantization methods also tend to introduce their own performance overheads.
This is really where TurboQuant's innovations lie. Google claims that it can achieve quality similar to BF16 using just 3.5 bits, while also mitigating those pesky overheads. At 4 bits, they claim as much as an 8x speedup on H100s when computing attention logits used to decide what in the context is or isn't important to the request.
And the researchers didn't stop there. In testing, they found they could crush the KV caches to 2.5 bits with minimal quality loss, which is where the claimed 6x memory reduction appears to have come from.
How does it work
TurboQuant is able to achieve this feat by combining two mathematical approaches: Quantized Johnson-Lindenstrauss (QJL) and PolarQuant.
PolarQuant works by mapping KV-cache vectors, which are just high-dimensional mathematical expressions of magnitude and direction, onto a circular grid that uses polar rather than Cartesian coordinates.
"This is comparable to replacing 'Go 3 blocks east, 4 blocks north' with 'go 5 blocks total at a 37-degree angle,'" Google's blog post explains.
Using this approach, the vector's magnitude and direction are now represented by its radius and angle, which the search giant explains eliminates the memory overhead associated with data normalization as each vector now shares a common reference point.
In addition to PolarQuant, Google also employs QJL to correct any errors introduced during the first phase and preserve the accuracy of the attention score used by the model to determine what information is or isn't important to serving a request.
The result is that these vectors can be stored using a fraction of memory. And this tech isn't limited to KV caches either. According to Google, the technology also has implications for vector databases used by search engines.
-
OpenAI gets $122B to 'just build things' as the world blows them up
-
Raspberry Pi leans into semiconductors as sales climb – especially in US and China
-
Arm says agentic AI needs a new kind of CPU. Intel's DC chief isn't buying it
-
Memory-makers' shares are down. Some RAM prices have eased. Blaming Google is not a good idea
Why TurboQuant won't deliver us from memory mayhem
With a claimed compression ratio of 6:1, it's not surprising that many on Wall Street tied memory makers' downward spirals to the introduction of TurboQuant.
But while the tech is likely to make AI inference clusters more efficient and therefore less expensive to operate, it's unlikely to curb demand for the NAND flash and DRAM memory used to store those KV-caches.
A year ago, open weights models like DeepSeek R1 offered context windows ranging from 64,000 to 256,000 tokens. Today, it's not uncommon to find open models sporting context windows exceeding one million tokens.
TurboQuant could allow an inference provider to make do with less memory, or let them serve up models with larger context windows. With code assistants and agentic frameworks like OpenClaw driving demand for larger context windows, the latter strikes us as the more likely of the two.
It seems that the industry watchers at TrendForce would agree. In a report published earlier this week, they predicted that TurboQuant will spark demand for long-context applications that drive demand for more memory rather than curb it. ®
The Register AI/ML
https://go.theregister.com/feed/www.theregister.com/2026/04/01/googles_turboquant_reality/Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
modelresearch
How analytics and AI are reshaping the boundaries of IT leadership
IT leadership is under more pressure than ever. As analytics and AI become embedded in everyday operations, expectations of CIOs are expanding, often without clear agreement on where IT leadership begins and ends. Many organizations assume that strong IT leadership will naturally extend to analytics and AI. In practice, that assumption no longer holds. The challenge is not that IT leadership has become less important, but that it is no longer sufficient on its own. To understand why, it helps to be clear about what has changed, what has not and where the boundaries of IT leadership now sit. What hasn’t changed about IT leadership At its core, IT leadership remains about the enablement and stewardship of the organization’s technology environment and capabilities. In practice, this has alway

ITの価値をどう伝えるか——世界のCIOが実践する5つのヒント
もはやテクノロジーなしに事業は動かない。それは誰もが知っている。それでもITがコストセンターと見られてしまうのはなぜか。理由の1つに、CIOがその価値をうまく伝えられていないと言えるのではないか。 「今日のCIOは1つのプレッシャーを感じている。テクノロジー投資の価値を、どうステークホルダーに伝えるか、だ」——GartnerのディスティングイッシュドVPアナリスト兼チーフリサーチオフィサー、Chris Howard氏は最近の投稿でこう指摘する。「CIOがITの価値をどう測り、どう伝えるか——それがITをビジネスのパートナーと見るか、コストセンターと見るかを決める」。 Schneider ElectricのCIO(北米担当)、Bobby Cain氏はこう言う。「ITの価値を発信することに徹底的にこだわる、これはデジタルリーダーの基本だと思っている。ビジネスとテクノロジーが融合する今、ITを利益の推進力として明確に位置づけることは我々CIOの責任だ」。 ではITの価値を効果的に伝えるために、CIOたちはどのようなアプローチを実践しているのか、以下に5つをみていこう。 ITのポジティブな影響を習慣的に発信する ITが提供する価値は明かだ。だからわざわざ宣伝しなくてもいい——そう思うべきではない。「価値が何かを言葉にしなければ、伝わらない」と、MITスローン経営大学院シニアレクチャラー、George Westerman氏は言う。人々はテクノロジーを電気のように当たり前のものとして扱い、問題が起きたときだけITに目を向ける。「多くの経営幹部はITの不満しか経験しない。発信しなければ、印象はネガティブなものだけになる」。 この発信は新任のうちから始めるのが理想だ。「就任直後が最も効果的で、期待値を設定できる。まだ発信していないなら、『不満の声があることは承知している。一部は事実だが、そ

Leading when the world is on fire and technology won’t stand still
When I first pitched this article, the world already felt unstable. Technology was accelerating, AI was dominating headlines and most leadership teams were still working out what it all meant. Since then, the temperature has risen again. Geopolitical tensions have escalated, economic pressure is building and political trust continues to fracture. The conditions organizations are operating in are more volatile, more complex and more uncertain than they have been in years. We have been here before, or at least it feels that way. After the global pandemic, there was a brief moment of optimism. Stability, or the illusion of it, returned. Then came the Ukraine invasion, persistent inflation, rising interest rates and tightening margins. Organizations continued to push for growth while absorbing
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Models
What happened when they installed ChatGPT on a nuclear supercomputer
If there’s anything that makes people more uncomfortable than highly advanced AI or nuclear weapons technology, it’s the combination of the two. But there’s been a symbiotic relationship between cutting-edge computing and America’s nuclear weapons program since the very beginning. In the fall of 1943, Nicholas Metropolis and Richard Feynman, two physicists working on the [ ]
State of Evaluation Study: Vector Institute Unlocks New Transparency in Benchmarking Global AI Models
Five takeaways for AI model developers, researchers and users Vector Institute’s first State of Evaluation study, developed by Vector’s AI Engineering team, shines new light on the evaluation and benchmarking [ ] The post State of Evaluation Study: Vector Institute Unlocks New Transparency in Benchmarking Global AI Models appeared first on Vector Institute for Artificial Intelligence .
Vector Institute Unveils Comprehensive Evaluation of Leading AI Models
At a glance: TORONTO, ON, April 10, 2025 — Canada’s Vector Institute has unveiled the results of its independent evaluation of leading large language models (LLMs), offering an objective look [ ] The post Vector Institute Unveils Comprehensive Evaluation of Leading AI Models appeared first on Vector Institute for Artificial Intelligence .
Exploring Intelligence: Vector Faculty Member Kelsey Allen’s Path from Particle Physics to Cognitive Machine Learning
How do humans and machines build models to enable problem-solving and innovation? This is the question that has shaped Kelsey Allen’s career. It’s guided her from high-energy physics to machine [ ] The post Exploring Intelligence: Vector Faculty Member Kelsey Allen’s Path from Particle Physics to Cognitive Machine Learning appeared first on Vector Institute for Artificial Intelligence .


Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!