Products Quantization Model Compression LLM Google DeepMind Inference Optimization

Google's TurboQuant AI-compression algorithm can reduce LLM memory usage by 6x

Ars Technica AIby Ryan WhitwamMarch 25, 20262 min read0 views

Even if you don’t know much about the inner workings of generative AI models, you probably know they need a lot of memory. Hence, it is currently almost impossible to buy a measly stick of RAM without getting fleeced. Google Research recently revealed TurboQuant, a compression algorithm that reduces the memory footprint of large language models (LLMs) while also boosting speed and maintaining accuracy.

TurboQuant is aimed at reducing the size of the key-value cache, which Google likens to a “digital cheat sheet” that stores important information so it doesn’t have to be recomputed. This cheat sheet is necessary because, as we say all the time, LLMs don’t actually know anything; they can do a good impression of knowing things through the use of vectors, which map the semantic meaning of tokenized text. When two vectors are similar, that means they have conceptual similarity.

High-dimensional vectors, which can have hundreds or thousands of embeddings, may describe complex information like the pixels in an image or a large data set. They also occupy a lot of memory and inflate the size of the key-value cache, bottlenecking performance. To make models smaller and more efficient, developers employ quantization techniques to run them at lower precision. The drawback is that the outputs get worse—the quality of token estimation goes down. With TurboQuant, Google’s early results show an 8x performance increase and 6x reduction in memory usage in some tests without a loss of quality.

Angles and errors

Applying TurboQuant to an AI model is a two-step process. To achieve high-quality compression, Google has devised a system called PolarQuant. Usually, vectors in AI models are encoded using standard XYZ coordinates, but PolarQuant converts vectors into polar coordinates in a Cartesian system. On this circular grid, the vectors are reduced to two pieces of information: a radius (core data strength) and a direction (the data’s meaning).

Original source

Ars Technica AI

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

QuantizationModel CompressionLLM

Products

You can now transfer your chats and personal information from other chatbots directly into Gemini

Google is launching "switching tools" that, just as it sounds, will make it easier for users of other chatbots to switch to Gemini.

TechCrunch AI

3m4 days ago

Products

Google is making it easier to import another AI’s memory into Gemini

After Anthropic updated its tool for copying another AI's memory into Claude earlier this month, Google Gemini is rolling out new "Import Memory" and "Import Chat History" features on desktop that can help users quickly copy over everything their current AI already knows about them. To use the "Import Memory" tool, users copy and paste […]

The Verge AI

2m4 days ago

Products

Apple will reportedly allow other AI chatbots to plug into Siri

Apple's iOS 27 update will allow users to choose the AI chatbot they want to link with Siri. That's according to a report from Bloomberg's Mark Gurman, who says third-party chatbots downloaded from the App Store, like Google's Gemini or Anthropic's Claude, will be able to fetch replies for Siri - similar to how the […]

The Verge AI

2m4 days ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 339 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Products

Products

With new plugins feature, OpenAI officially takes Codex beyond coding

Ars Technica AI

1m3 days ago

Products

AI Research Is Getting Harder to Separate From Geopolitics

A policy change announced by NeurIPS, the world’s leading AI research conference, drew widespread backlash from Chinese researchers this week and then was quickly reversed.

Wired AI

5m3 days ago

Products

Memory chip giant SK hynix could help end ‘RAMmageddon’ with blockbuster US IPO

SK hynix’s potential U.S. listing could raise $10-$14 billion to help it build more capacity, encourage others to follow, and end the 'RAMmageddon' memory shortage.

TechCrunch AI

4m3 days ago

Products

The latest in data centers, AI, and energy

Massive new data centers are the physical foundation for tech companies’ hopes and dreams for AI. But the rush to expand warehouses full of energy-hungry servers has also kicked up fights across the world over their impact on power grids, utility bills, nearby communities, and the environment.  From audacious plans to launch data centers into […]

The Verge AI

14m3 days ago

Google&#039;s TurboQuant AI-compression algorithm can reduce LLM memory usage by 6x

Angles and errors

Daily AI Digest

More about

You can now transfer your chats and personal information from other chatbots directly into Gemini

Google is making it easier to import another AI’s memory into Gemini

Apple will reportedly allow other AI chatbots to plug into Siri

Knowledge Map

Connected Articles — Knowledge Graph

Discussion

More in Products

With new plugins feature, OpenAI officially takes Codex beyond coding

AI Research Is Getting Harder to Separate From Geopolitics

Memory chip giant SK hynix could help end &#8216;RAMmageddon&#8217; with blockbuster US IPO

The latest in data centers, AI, and energy

Google's TurboQuant AI-compression algorithm can reduce LLM memory usage by 6x

Memory chip giant SK hynix could help end ‘RAMmageddon’ with blockbuster US IPO