Live
Black Hat USADark ReadingBlack Hat AsiaAI BusinessThis International Fact-Checking Day, use these 5 tips to spot AI-generated contentFast Company TechA Differentiable Programming System to Bridge Machine Learning and ScientificComputingDev.to AIWhy Your AI Copilot Builds the Wrong Thing (And How to Fix It)Dev.to AIBuilding Sentinel Gate: A 3-Layer Security Pipeline for AI AgentsDev.to AIHow I Built a Self-Healing Memory System for AI AgentsDev.to AIAXIOM Week 2 Wrap — 65 Articles, 73 Cold Emails, and a New Agent on the NetworkDev.to AIA Feature I Never Planned Emerged From Persona Interviews — Here's Exactly HowDev.to AIThe Algorithmic Edge: Launching Your Day Trading Journey with AI Sentiment and Next-Gen ChartingDev.to AIBig Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.Dev.to AIGemma 4: Byte for byte, the most capable open modelsGoogle DeepMindGoogle now lets you direct avatars through prompts in its Vids appTechCrunch AIGoogle releases Gemma 4, a family of open models built off of Gemini 3EngadgetBlack Hat USADark ReadingBlack Hat AsiaAI BusinessThis International Fact-Checking Day, use these 5 tips to spot AI-generated contentFast Company TechA Differentiable Programming System to Bridge Machine Learning and ScientificComputingDev.to AIWhy Your AI Copilot Builds the Wrong Thing (And How to Fix It)Dev.to AIBuilding Sentinel Gate: A 3-Layer Security Pipeline for AI AgentsDev.to AIHow I Built a Self-Healing Memory System for AI AgentsDev.to AIAXIOM Week 2 Wrap — 65 Articles, 73 Cold Emails, and a New Agent on the NetworkDev.to AIA Feature I Never Planned Emerged From Persona Interviews — Here's Exactly HowDev.to AIThe Algorithmic Edge: Launching Your Day Trading Journey with AI Sentiment and Next-Gen ChartingDev.to AIBig Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.Dev.to AIGemma 4: Byte for byte, the most capable open modelsGoogle DeepMindGoogle now lets you direct avatars through prompts in its Vids appTechCrunch AIGoogle releases Gemma 4, a family of open models built off of Gemini 3Engadget
AI NEWS HUBbyEIGENVECTOREigenvector

ITQ3_S: High-Fidelity 3-bit LLM Inference via Interleaved Ternary Quantization with Rotation-Domain Smoothing

arXivMarch 31, 202610 min read0 views
Source Quiz

arXiv:2603.27914v1 Announce Type: cross Abstract: We present \textbf{ITQ3\_S} (Interleaved Ternary Quantization -- Specialized), a novel 3-bit weight quantization format for large language models (LLMs) that integrates \textbf{TurboQuant (TQ)}, a rotation-domain adaptive quantization strategy based on the Fast Walsh-Hadamard Transform (FWHT). Conventional 3-bit quantization methods suffer from catastrophic precision loss caused by heavy-tailed weight distributions and inter-channel outliers. ITQ3\_S addresses this fundamental limitation by pre-rotating the weight space via FWHT prior to quanti — Edward J. Yoon

View PDF HTML (experimental)

Abstract:We present ITQ3_S (Interleaved Ternary Quantization -- Specialized), a novel 3-bit weight quantization format for LLMs integrating TurboQuant (TQ), a rotation-domain strategy based on the Fast Walsh-Hadamard Transform (FWHT). Conventional 3-bit methods suffer precision loss from heavy-tailed weight distributions and inter-channel outliers. ITQ3_S pre-rotates the weight space via FWHT before quantization, spreading outlier energy across the vector and inducing a near-Gaussian distribution amenable to uniform ternary coding. We derive a rigorous dequantization procedure fusing a 256-point Inverse FWHT into the CUDA shared-memory loading stage, ensuring reconstruction error is bounded exclusively by the ternary quantization grid with no additional error from the transform inversion. For any weight vector $\mathbf{w} \in \mathbb{R}^{256}$, the reconstruction satisfies $|\hat{\mathbf{w}} - \mathbf{w}|_2 \leq \epsilon_q$, strictly smaller than uniform 3-bit baselines that do not exploit rotation-induced distribution normalization. TurboQuant lacks a native CUDA kernel, precluding direct deployment; naively composing TQ with existing weight quantizers introduces domain mismatch errors that accumulate across layers, degrading quality below standard 3-bit baselines. ITQ3_S resolves this by co-designing the FWHT rotation and quantization kernel as a unified pipeline grounded in the IQ3_S weight format, with the inverse transform fused into the CUDA MMQ kernel. Empirically, on the NVIDIA RTX 5090 (Blackwell), ITQ3_S achieves perplexity competitive with FP16 while delivering throughput exceeding 1.5x that of 4-bit alternatives via optimized DP4A and Tensor Core scheduling. Our results establish ITQ3_S as a practical, mathematically grounded solution for high-fidelity LLM deployment on consumer hardware.

Comments: 12 pages, 4 figures, 3 tables

Subjects:

Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)

MSC classes: 68T50, 65T50, 94A29

ACM classes: I.2.7; C.1.4; B.3.2

Cite as: arXiv:2603.27914 [cs.LG]

(or arXiv:2603.27914v2 [cs.LG] for this version)

https://doi.org/10.48550/arXiv.2603.27914

arXiv-issued DOI via DataCite

Submission history

From: Edward J. Yoon [view email] [v1] Mon, 30 Mar 2026 00:03:22 UTC (12 KB) [v2] Tue, 31 Mar 2026 03:02:45 UTC (12 KB)

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by Eigenvector · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

Knowledge Map

Knowledge Map
TopicsEntitiesSource
ITQ3_S: Hig…researchpaperarxivaiartificial-…arXiv

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 166 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!