Research Papers research paper arxiv ai artificial-intelligence

MicroMix: Efficient Mixed-Precision Quantization with Microscaling Formats for Large Language Models

arXivMarch 31, 202610 min read0 views

arXiv:2508.02343v2 Announce Type: replace-cross Abstract: Quantization significantly accelerates inference in large language models (LLMs) by replacing original high-precision matrices with low-precision counterparts. Recent advances in weight-activation quantization have primarily focused on mapping both weights and activations to the INT4 format. Although the new FP4 Tensor Cores in NVIDIA's Blackwell architecture offer up to 4x speedup over FP16, existing INT4-based kernels fail to fully exploit this capability due to mismatched data formats. To bridge this gap, we propose MicroMix, a co-de — Wenyuan Liu, Haoqian Meng, Yilun Luo, Yafei Zhao, Peng Zhang, Xindian Ma

View PDF

Abstract:Quantization significantly accelerates inference in large language models (LLMs) by replacing original high-precision matrices with low-precision counterparts. Recent advances in weight-activation quantization have primarily focused on mapping both weights and activations to the INT4 format. Although the new FP4 Tensor Cores in NVIDIA's Blackwell architecture offer up to 4x speedup over FP16, existing INT4-based kernels fail to fully exploit this capability due to mismatched data formats. To bridge this gap, we propose MicroMix, a co-designed mixed-precision quantization algorithm and GEMM kernel based on Microscaling (MX) data formats. Tailored for the Blackwell architecture, the MicroMix kernel supports arbitrary combinations of MXFP4, MXFP6, and MXFP8 channels, and produces BFloat16 outputs. To achieve a favorable trade-off between accuracy and efficiency for each linear layer, we introduce quantization thresholds that identify activation elements where lower-precision formats (MXFP4 or MXFP6) incur excessive quantization error. Our algorithm selectively allocates higher-precision channels to preserve accuracy while maintaining compute efficiency. On the Llama and Qwen model families, MicroMix achieves near-FP16 performance across diverse downstream tasks with an average precision of 5 bits. In particular, Qwen2.5-32B-Base, Coder and Math exhibit lossless accuracy on zero-shot, code generation, and mathematical reasoning benchmarks. In addition, on RTX 5070Ti laptop and RTX 5090 GPUs, our kernel achieves 2.29-3.38x acceleration compared to TensorRT-FP16. Our code is available at this https URL.

Subjects:

Machine Learning (cs.LG); Artificial Intelligence (cs.AI)

Cite as: arXiv:2508.02343 [cs.LG]

(or arXiv:2508.02343v2 [cs.LG] for this version)

https://doi.org/10.48550/arXiv.2508.02343

arXiv-issued DOI via DataCite

Journal reference: ICLR 2026

Submission history

From: Wenyuan Liu [view email] [v1] Mon, 4 Aug 2025 12:22:39 UTC (3,629 KB) [v2] Mon, 30 Mar 2026 09:16:28 UTC (6,716 KB)

Original source

arXiv

https://arxiv.org/abs/2508.02343

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

researchpaperarxiv

Market News

AI giant Anthropic says 'exploring' Australia data centre investments - carrollspaper.com

AI giant Anthropic says 'exploring' Australia data centre investments carrollspaper.com

GNews AI Australia

1m2 days ago

Research PapersLive

Huihui-Qwen3.5-9B-Abliterated: What This Uncensored Model Does

Explore Huihui-Qwen3.5-9B-Abliterated, an uncensored Qwen variant built for research, testing, and studying safety filter removal. Read All

Hackernoon AI

1mabout 2 hours ago

ProductsRecent

Uniformed Services University Introduces Web App for Ethical AI Use in Medical Research - Uniformed Services University

Uniformed Services University Introduces Web App for Ethical AI Use in Medical Research Uniformed Services University

GNews AI healthcare

1m1 day ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 171 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Research Papers

Research PapersLive

Huihui-Qwen3.5-9B-Abliterated: What This Uncensored Model Does

Explore Huihui-Qwen3.5-9B-Abliterated, an uncensored Qwen variant built for research, testing, and studying safety filter removal. Read All

Hackernoon AI

1mabout 2 hours ago

Research PapersFresh

New research could empower people without AI expertise to help create trustworthy AI applications

Involving people without AI expertise in the development and evaluation of artificial intelligence applications could help create better, fairer, and more trustworthy automated decision-making systems, new research suggests. After enlisting members of the public to evaluate the potential impacts of two real-world applications, researchers from UK universities will present a paper at a major international computing conference which suggests how "participatory AI auditing" could improve AI decision-making in the future.

TechXplore AI

1mabout 3 hours ago

Research PapersFresh

Google Research touts memory-compression breakthrough for AI processing - Network World

Google Research touts memory-compression breakthrough for AI processing Network World

GNews AI Google

1mabout 5 hours ago

Research Papers

Consistency Amplifies: How Behavioral Variance Shapes Agent Accuracy

Analysis of behavioral consistency in large language model agents reveals that while consistent performance correlates with higher accuracy, consistency can amplify both correct and incorrect interpretations, emphasizing that accurate interpretation is more crucial than execution consistency for production deployment. (2 upvotes on HuggingFace)

HuggingFace Papers

2m8 days ago