Research Papers research paper arxiv ai artificial-intelligence

GlowQ: Group-Shared LOw-Rank Approximation for Quantized LLMs

arXivMarch 26, 202610 min read0 views

Quantization techniques such as BitsAndBytes, AWQ, and GPTQ are widely used as a standard method in deploying large language models but often degrades accuracy when using low-bit representations, e.g., 4 bits. Low-rank correction methods (e.g., LQER, QERA, ASER) has been proposed to mitigate this issue, however, they restore all layers and insert error-correction modules into every decoder block, which increases latency and memory overhead. To address this limitation, we propose GlowQ, a group-shared low-rank approximation for quantized LLMs that caches a single shared right factor per input-s — Selim An, Il hong Suh, Yeseong Kim

View PDF HTML (experimental)

Abstract:Quantization techniques such as BitsAndBytes, AWQ, and GPTQ are widely used as a standard method in deploying large language models but often degrades accuracy when using low-bit representations, e.g., 4 bits. Low-rank correction methods (e.g., LQER, QERA, ASER) has been proposed to mitigate this issue, however, they restore all layers and insert error-correction modules into every decoder block, which increases latency and memory overhead. To address this limitation, we propose GlowQ, a group-shared low-rank approximation for quantized LLMs that caches a single shared right factor per input-sharing group and restores only the groups or layers that yield the highest accuracy benefit. GlowQ computes the high-precision projection once per input-sharing group and reuses it across its modules, reducing parameter and memory overhead, and retaining the expressivity of layer-specific corrections. We also propose a selective variant, GlowQ-S, that applies the cached shared module only where it provides the largest benefit. Compared with strong baselines, our approach reduces TTFB by (5.6%) and increases throughput by (9.6%) on average, while reducing perplexity on WikiText-2 by (0.17%) and increasing downstream accuracy by 0.42 percentage points. The selective model GlowQ-S further reduces latency, cutting TTFB by (23.4%) and increasing throughput by (37.4%), while maintaining accuracy within 0.2 percentage points on average.

Subjects:

Machine Learning (cs.LG); Artificial Intelligence (cs.AI)

Cite as: arXiv:2603.25385 [cs.LG]

(or arXiv:2603.25385v1 [cs.LG] for this version)

https://doi.org/10.48550/arXiv.2603.25385

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Selim An [view email] [v1] Thu, 26 Mar 2026 12:36:44 UTC (7,559 KB)

Original source

arXiv

https://arxiv.org/abs/2603.25385v1

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

researchpaperarxiv

Countries

From climate storytelling to AI innovation: Rice researchers take on global challenges at SXSW - Rice University

From climate storytelling to AI innovation: Rice researchers take on global challenges at SXSW Rice University

GNews AI climate

1m16 days ago

Research PapersLive

🔮 Autoresearch and the experimental society - exponentialview.co

🔮 Autoresearch and the experimental society exponentialview.co

Google News: Machine Learning

1mabout 1 hour ago

Models

Exclusive | Caltech Researchers Claim Radical Compression of High-Fidelity AI Models - WSJ

Exclusive | Caltech Researchers Claim Radical Compression of High-Fidelity AI Models WSJ

Google News: LLM

1m2 days ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 171 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Research Papers

Research PapersLive

🔮 Autoresearch and the experimental society - exponentialview.co

🔮 Autoresearch and the experimental society exponentialview.co

Google News: Machine Learning

1mabout 1 hour ago

Research PapersLive

Springing into AI: PyTorch Conference Europe and ICLR 2026

Article URL: https://www.collabora.com/news-and-blog/news-and-events/springing-into-ai-pytorch-conference-europe-and-iclr-2026.html Comments URL: https://news.ycombinator.com/item?id=47619120 Points: 2 # Comments: 0

Hacker News AI Top

1mabout 1 hour ago

Research Papers

Vector researchers presenting more than 98 papers at NeurIPS 2024

Leading researchers from Vector are presenting groundbreaking research at this year s Conference on Neural Information Processing Systems (NeurIPS). The conference, taking place December 10-15 in Vancouver and online, showcases innovative [ ] The post Vector researchers presenting more than 98 papers at NeurIPS 2024 appeared first on Vector Institute for Artificial Intelligence .

Vector Institute

1mover 1 year ago

Research Papers

Enterprise AI vs. Consumer AI: What’s the Difference? - Oracle

Enterprise AI vs. Consumer AI: What’s the Difference? Oracle

GNews AI UK

1m24 days ago