APEX MoE quantized models boost with 33% faster inference and TurboQuant (14% of speedup in prompt processing)
I've just released APEX (Adaptive Precision for EXpert Models): a novel MoE quantization technique that outperforms Unsloth Dynamic 2.0 on accuracy while being 2x smaller for MoE architectures. Benchmarked on Qwen3.5-35B-A3B, but the method applies to any MoE model. Half the size of Q8. Perplexity comparable to F16. Works with stock llama.cpp with no patches. Open source (of course!), with github.com/mudler/LocalAI team! https://preview.redd.it/uv2bnfheymsg1.jpg?width=1632 format=pjpg auto=webp s=3eca979e8f9ca6b75d206eecdf29308b74aed530 Perplexity by itself doesn't say the full story. KL divergence tells a story perplexity doesn't: https://preview.redd.it/jn9ua2ksymsg1.jpg?width=1617 format=pjpg auto=webp s=7df969308e10aa6b6d31098c92fca1c14bb42a40 Tiers for every GPU: - I-Quality: 21.3 GB
Could not retrieve the full article text.
Read on Reddit r/LocalLLaMA →Reddit r/LocalLLaMA
https://www.reddit.com/r/LocalLLaMA/comments/1s9vzry/apex_moe_quantized_models_boost_with_33_faster/Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
llamamodelbenchmarkMicrosoft launches in-house AI models MAI-Transcribe-1, MAI-Voice-1, and MAI-Image-2, built by its superintelligence team, as it pursues "AI self-sufficiency" (Michael Nuñez/VentureBeat)
Michael Nu ez / VentureBeat : Microsoft launches in-house AI models MAI-Transcribe-1, MAI-Voice-1, and MAI-Image-2, built by its superintelligence team, as it pursues AI self-sufficiency Microsoft on Wednesday launched three new foundational AI models it built entirely in-house a state
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.



Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!