Live
Black Hat USAAI BusinessBlack Hat AsiaAI BusinessAI’s Energy Demands and Nuclear’s Uncertain Future - Georgetown Journal of International AffairsGNews AI energyPhysical AI Accelerates Global Energy Efficiency, Smart Grid Innovation, and Autonomous Infrastructure Growth - AltEnergyMagGNews AI energyIs Water Usage in AI Data Centres Sustainable? - ImpakterGNews AI energyCentOS Launches Accelerated Infrastructure Enablement For Driving NVIDIA AI Factories - PhoronixGNews AI NVIDIAETL vs ELTAI YouTube Channel 6AI #162: Visions of MythosLessWrong AIThe Fundrise Innovation Fund (VCX) Participates in OpenAI’s $122 Billion Funding Round - citybizGoogle News: OpenAIIBM, Arm team up to bring Arm software to IBM Z mainframesCIO MagazineAI project ‘failure’ has little to do with AI - ComputerworldGoogle News: Generative AIAnaxi Labs Partners with Carnegie Mellon to Tackle AI's Biggest Problem: Economics - Lexington Herald LeaderGoogle News: Generative AIOpenAI’s record $122 billion round is just the start - The Business JournalsGoogle News: OpenAIPrediction: Nvidia Will Do the Unthinkable and Hit $100 Before the End of 2026 - The Motley FoolGNews AI NVIDIABlack Hat USAAI BusinessBlack Hat AsiaAI BusinessAI’s Energy Demands and Nuclear’s Uncertain Future - Georgetown Journal of International AffairsGNews AI energyPhysical AI Accelerates Global Energy Efficiency, Smart Grid Innovation, and Autonomous Infrastructure Growth - AltEnergyMagGNews AI energyIs Water Usage in AI Data Centres Sustainable? - ImpakterGNews AI energyCentOS Launches Accelerated Infrastructure Enablement For Driving NVIDIA AI Factories - PhoronixGNews AI NVIDIAETL vs ELTAI YouTube Channel 6AI #162: Visions of MythosLessWrong AIThe Fundrise Innovation Fund (VCX) Participates in OpenAI’s $122 Billion Funding Round - citybizGoogle News: OpenAIIBM, Arm team up to bring Arm software to IBM Z mainframesCIO MagazineAI project ‘failure’ has little to do with AI - ComputerworldGoogle News: Generative AIAnaxi Labs Partners with Carnegie Mellon to Tackle AI's Biggest Problem: Economics - Lexington Herald LeaderGoogle News: Generative AIOpenAI’s record $122 billion round is just the start - The Business JournalsGoogle News: OpenAIPrediction: Nvidia Will Do the Unthinkable and Hit $100 Before the End of 2026 - The Motley FoolGNews AI NVIDIA
AI NEWS HUBbyEIGENVECTOREigenvector

Google's TurboQuant saves memory, but won't save us from DRAM-pricing hell

The Register AI/MLby Tobias Mann https://search.theregister.com/?author=Tobias%20MannApril 1, 20261 min read0 views
Source Quiz

<h4>Chocolate Factory’s compression tech clears the way to cheaper AI inference, not more affordable memory</h4> <p>When Google unveiled <a target="_blank" rel="nofollow" href="https://research.google/blog/turboquant-redefining-ai-efficiency-with-extreme-compression/">TurboQuant</a>, an AI data compression technology that promises to slash the amount of memory required to serve models, many hoped it would help with a memory shortage that has seen prices triple since last year. Not so much.…</p>

When Google unveiled TurboQuant, an AI data compression technology that promises to slash the amount of memory required to serve models, many hoped it would help with a memory shortage that has seen prices triple since last year. Not so much.

TurboQuant isn't the savior you might be hoping for. Having said that, the underlying technology is still worth a closer look as it has major implications for model devs and inference providers.

What the heck is TurboQuant

Detailed by Google researchers in a recent blog post, TurboQuant is essentially a method of compressing data used in generative AI from higher to lower precisions, an approach commonly referred to as quantization.

According to researchers, TurboQuant has the potential to cut memory consumption during inference by at least 6x, a bold claim at a time when DRAM and NAND prices are at record highs.

However, unlike most quantization methods, TurboQuant doesn't shrink the model. Instead it aims to reduce the amount of memory required to store the key value (KV) caches used to maintain context during LLM inference.

In a nutshell, the KV cache is a bit like the model's short-term memory. During a chat session, for example, the KV cache is how the model keeps track of your conversation.

Where things get tricky is that these KV caches can pile up quite quickly, often consuming more memory than the model itself.

Usually, these KV caches are stored at 16-bit precision, so if you can shrink the number of bits used to store them to eight or even four bits, you can reduce the memory required by a factor of 2x to 4x.

While TurboQuant has certainly brought attention to KV cache quantization, the overarching idea isn't new. In fact, it's quite common for inference engines to store KV caches at FP8 for these reasons.

However, this kind of quantization isn't free. Lower precision means fewer bits to store key values and therefore less memory. These quantization methods also tend to introduce their own performance overheads.

This is really where TurboQuant's innovations lie. Google claims that it can achieve quality similar to BF16 using just 3.5 bits, while also mitigating those pesky overheads. At 4 bits, they claim as much as an 8x speedup on H100s when computing attention logits used to decide what in the context is or isn't important to the request.

And the researchers didn't stop there. In testing, they found they could crush the KV caches to 2.5 bits with minimal quality loss, which is where the claimed 6x memory reduction appears to have come from.

How does it work

TurboQuant is able to achieve this feat by combining two mathematical approaches: Quantized Johnson-Lindenstrauss (QJL) and PolarQuant.

PolarQuant works by mapping KV-cache vectors, which are just high-dimensional mathematical expressions of magnitude and direction, onto a circular grid that uses polar rather than Cartesian coordinates.

"This is comparable to replacing 'Go 3 blocks east, 4 blocks north' with 'go 5 blocks total at a 37-degree angle,'" Google's blog post explains.

Using this approach, the vector's magnitude and direction are now represented by its radius and angle, which the search giant explains eliminates the memory overhead associated with data normalization as each vector now shares a common reference point.

In addition to PolarQuant, Google also employs QJL to correct any errors introduced during the first phase and preserve the accuracy of the attention score used by the model to determine what information is or isn't important to serving a request.

The result is that these vectors can be stored using a fraction of memory. And this tech isn't limited to KV caches either. According to Google, the technology also has implications for vector databases used by search engines.

  • OpenAI gets $122B to 'just build things' as the world blows them up

  • Raspberry Pi leans into semiconductors as sales climb – especially in US and China

  • Arm says agentic AI needs a new kind of CPU. Intel's DC chief isn't buying it

  • Memory-makers' shares are down. Some RAM prices have eased. Blaming Google is not a good idea

Why TurboQuant won't deliver us from memory mayhem

With a claimed compression ratio of 6:1, it's not surprising that many on Wall Street tied memory makers' downward spirals to the introduction of TurboQuant.

But while the tech is likely to make AI inference clusters more efficient and therefore less expensive to operate, it's unlikely to curb demand for the NAND flash and DRAM memory used to store those KV-caches.

A year ago, open weights models like DeepSeek R1 offered context windows ranging from 64,000 to 256,000 tokens. Today, it's not uncommon to find open models sporting context windows exceeding one million tokens.

TurboQuant could allow an inference provider to make do with less memory, or let them serve up models with larger context windows. With code assistants and agentic frameworks like OpenClaw driving demand for larger context windows, the latter strikes us as the more likely of the two.

It seems that the industry watchers at TrendForce would agree. In a report published earlier this week, they predicted that TurboQuant will spark demand for long-context applications that drive demand for more memory rather than curb it. ®

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by Eigenvector · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

modelresearch
ITの価値をどう伝えるか——世界のCIOが実践する5つのヒント
Analyst NewsLive

ITの価値をどう伝えるか——世界のCIOが実践する5つのヒント

もはやテクノロジーなしに事業は動かない。それは誰もが知っている。それでもITがコストセンターと見られてしまうのはなぜか。理由の1つに、CIOがその価値をうまく伝えられていないと言えるのではないか。 「今日のCIOは1つのプレッシャーを感じている。テクノロジー投資の価値を、どうステークホルダーに伝えるか、だ」——GartnerのディスティングイッシュドVPアナリスト兼チーフリサーチオフィサー、Chris Howard氏は最近の投稿でこう指摘する。「CIOがITの価値をどう測り、どう伝えるか——それがITをビジネスのパートナーと見るか、コストセンターと見るかを決める」。 Schneider ElectricのCIO(北米担当)、Bobby Cain氏はこう言う。「ITの価値を発信することに徹底的にこだわる、これはデジタルリーダーの基本だと思っている。ビジネスとテクノロジーが融合する今、ITを利益の推進力として明確に位置づけることは我々CIOの責任だ」。 ではITの価値を効果的に伝えるために、CIOたちはどのようなアプローチを実践しているのか、以下に5つをみていこう。 ITのポジティブな影響を習慣的に発信する ITが提供する価値は明かだ。だからわざわざ宣伝しなくてもいい——そう思うべきではない。「価値が何かを言葉にしなければ、伝わらない」と、MITスローン経営大学院シニアレクチャラー、George Westerman氏は言う。人々はテクノロジーを電気のように当たり前のものとして扱い、問題が起きたときだけITに目を向ける。「多くの経営幹部はITの不満しか経験しない。発信しなければ、印象はネガティブなものだけになる」。 この発信は新任のうちから始めるのが理想だ。「就任直後が最も効果的で、期待値を設定できる。まだ発信していないなら、『不満の声があることは承知している。一部は事実だが、そ

Knowledge Map

Knowledge Map
TopicsEntitiesSource
Google's Tu…modelresearchThe Registe…

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 175 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!

More in Models