Research Papers research paper arxiv machine-learning deep-learning

IsoQuant: Hardware-Aligned SO(4) Isoclinic Rotations for LLM KV Cache Compression

arXivMarch 31, 202610 min read0 views

arXiv:2603.28430v1 Announce Type: new Abstract: Orthogonal feature decorrelation is effective for low-bit online vector quantization, but dense random orthogonal transforms incur prohibitive $O(d^2)$ storage and compute. RotorQuant reduces this cost with blockwise $3$D Clifford rotors, yet the resulting $3$D partition is poorly aligned with modern hardware and offers limited local mixing. We propose \textbf{IsoQuant}, a blockwise rotation framework based on quaternion algebra and the isoclinic decomposition of $SO(4)$. It represents each $4$D block as a quaternion and applies a closed-form tra — Zhongping Ji

View PDF HTML (experimental)

Abstract:Orthogonal feature decorrelation is effective for low-bit online vector quantization, but dense random orthogonal transforms incur prohibitive $O(d^2)$ storage and compute. RotorQuant reduces this cost with blockwise $3$D Clifford rotors, yet the resulting $3$D partition is poorly aligned with modern hardware and offers limited local mixing. We propose \textbf{IsoQuant}, a blockwise rotation framework based on quaternion algebra and the isoclinic decomposition of $SO(4)$. It represents each $4$D block as a quaternion and applies a closed-form transform $T(v)=q_L v \overline{q_R}$. This yields two main variants: \emph{IsoQuant-Full}, which realizes the full $SO(4)$ rotation, and \emph{IsoQuant-Fast}, which keeps only one isoclinic factor for lower cost; the framework also admits a lightweight $2$D special case. At $d=128$, IsoQuant-Full reduces forward rotation cost from about $2{,}408$ FMAs in RotorQuant to $1{,}024$, while IsoQuant-Fast further reduces it to $512$. Across $18$ fused CUDA settings with $d \in {128,256,512}$, bit widths ${2,3,4}$, and FP16/FP32 execution, IsoQuant achieves mean kernel-level speedups of about $4.5\times$--$4.7\times$ over RotorQuant while maintaining comparable reconstruction MSE, with peak speedups above $6\times$. Current validation is limited to the stage-1 quantize--dequantize path on synthetic normalized vectors; end-to-end KV-cache evaluation remains future work.

Comments: 11 pages

Subjects:

Machine Learning (cs.LG); Computation and Language (cs.CL)

Cite as: arXiv:2603.28430 [cs.LG]

(or arXiv:2603.28430v1 [cs.LG] for this version)

https://doi.org/10.48550/arXiv.2603.28430

arXiv-issued DOI via DataCite

Submission history

From: Zhongping Ji [view email] [v1] Mon, 30 Mar 2026 13:37:45 UTC (14 KB)

Original source

arXiv

https://arxiv.org/abs/2603.28430

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

researchpaperarxiv

Research Papers

Consistency Amplifies: How Behavioral Variance Shapes Agent Accuracy

Analysis of behavioral consistency in large language model agents reveals that while consistent performance correlates with higher accuracy, consistency can amplify both correct and incorrect interpretations, emphasizing that accurate interpretation is more crucial than execution consistency for production deployment. (2 upvotes on HuggingFace)

HuggingFace Papers

8m8 days ago

Research PapersRecent

A Survey of On-Policy Distillation for Large Language Models

On-Policy Distillation for large language models unifies diverse approaches through an f-divergence framework organized by feedback signals, teacher access, and loss granularity. (4 upvotes on HuggingFace)

HuggingFace Papers

8m1 day ago

Research Papers

Brevity Constraints Reverse Performance Hierarchies in Language Models

Large language models can underperform smaller ones due to verbose responses that introduce errors, but constraining output length reveals their superior capabilities and improves performance across benchmarks. (16 upvotes on HuggingFace)

HuggingFace Papers

8m23 days ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 169 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Research Papers

Research Papers

Reasoning Shift: How Context Silently Shortens LLM Reasoning

Reasoning behaviors in large language models compress under varied contextual conditions, potentially impacting performance on complex tasks despite maintaining accuracy on simpler ones. (22 upvotes on HuggingFace)

HuggingFace Papers

8m1 day ago