Exclusive | Caltech Researchers Claim Radical Compression of High-Fidelity AI Models - WSJ

Google News: LLMMarch 31, 20261 min read0 views

Exclusive | Caltech Researchers Claim Radical Compression of High-Fidelity AI Models WSJ

Could not retrieve the full article text.

Original source

Google News: LLM

https://news.google.com/rss/articles/CBMiuANBVV95cUxPMlBMQk5fYWJpWFI3YzRRQW03ZVFVVkV3Rk1YSXV6TXFaSUtEM3hGQXdSaGxoSThrdWpOREZtOFVJaTBYQ3dWS0x0cnoyMUZDS2xMQmZSUEc3N0xlQXEyUW0zOFU1bFBfMlJWdW4zOHpiOGlZUUtEUjI1cDYtR21GV04tNzVPZTEtZ3dwRHNGR3lsLW1TRHp3N3h5UEdYUEI2S2tGM0lxSVZXZnB3eERsLVVVTEU4czJkNlQxUXVWVmMta2FxSTZMdTktcXhLUExKbUJWSlZ4ZGdVRFRRZDgwX3cwbGt5ajFvU0gxdUFXOXFQQXcyb2ZKazJVa3hBeTQ0T1JWTnA3Q2ZoTGRtcVJ5eVFaUkktSUpYS1h6YTF2R3Q4bjB3eWdGbF8zeXBBMHFfdmticno3c25hbklhY2VabVlCX25ldGZ6RThmY2ZXeWFjMGhpNHJ2dHhfSTByampBbEpVUjU0MjVhS05uRXctWjI2UWVQc0M3Qk5DOHp6UDNPWFV4OUpwSFp3bXdyZ1RGVmtHUEVvTkxfMkMyWlRGSzBsNG5yUzg5cHZEeEZpUTM0Um5UajdLVg?oc=5

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

modelresearch

ModelsFresh

Bench 2xMI50 Qwen3.5 27b vs Gemma4 31B (vllm-gfx906-mobydick)

Inference engine used (vllm fork) : https://github.com/ai-infos/vllm-gfx906-mobydick/tree/main Huggingface Quants used: QuantTrio/Qwen3.5-27B-AWQ vs cyankiwi/gemma-4-31B-it-AWQ-4bit Relevant commands to run : docker run -it --name vllm-gfx906-mobydick -v ~/llm/models:/models --network host --device=/dev/kfd --device=/dev/dri --group-add video --group-add $(getent group render | cut -d: -f3) --ipc=host aiinfos/vllm-gfx906-mobydick:latest FLASH_ATTENTION_TRITON_AMD_ENABLE="TRUE" OMP_NUM_THREADS=4 VLLM_LOGGING_LEVEL=DEBUG vllm serve \ /models/gemma-4-31B-it-AWQ-4bit \ --served-model-name gemma-4-31B-it-AWQ-4bit \ --dtype float16 \ --max-model-len auto \ --gpu-memory-utilization 0.95 \ --enable-auto-tool-choice \ --tool-call-parser gemma4 \ --reasoning-parser gemma4 \ --mm-processor-cache-gb 1

Reddit r/LocalLLaMA

3mabout 4 hours ago

Open Source AILive

Vllm gemma4 26b a4b it-nvfp4 run success

#!/usr/bin/env bash set -euo pipefail BASE_DIR="/mnt/d/AI/docker-gemma4" PATCH_DIR="$BASE_DIR/nvfp4_patch" BUILD_DIR="$BASE_DIR/build" HF_CACHE_DIR="$BASE_DIR/hf-cache" LOG_DIR="$BASE_DIR/logs" PATCH_FILE="$PATCH_DIR/gemma4_patched.py" DOCKERFILE_PATH="$BUILD_DIR/Dockerfile" BASE_IMAGE="vllm/vllm-openai:gemma4" PATCHED_IMAGE="vllm-gemma4-nvfp4-patched" CONTAINER_NAME="vllm-gemma4-nvfp4" MODEL_ID="bg-digitalservices/Gemma-4-26B-A4B-it-NVFP4" SERVED_MODEL_NAME="gemma-4-26b-a4b-it-nvfp4" GPU_MEMORY_UTILIZATION="0.88" MAX_MODEL_LEN="512" MAX_NUM_SEQS="1" PORT=" " PATCH_URL=" https://huggingface.co/bg-digitalservices/Gemma-4-26B-A4B-it-NVFP4/resolve/main/gemma4_patched.py?download=true " if [[ -z "${HF_TOKEN:-}" ]]; then echo "[ERROR] HF_TOKEN environment variable is empty." echo "Please run th

Reddit r/LocalLLaMA

2mabout 1 hour ago

Open Source AIFresh

TurboQuant on Apple Silicon: real benchmarks on Mac Mini M4 16GB and M3 Max 48GB

I’ve been testing TurboQuant this week on two machines and wanted to share the actual numbers. Why this matters: TurboQuant compresses the KV cache, not the model weights. On long contexts, KV cache can take several GB of memory, so reducing it can make a big difference even when throughput stays similar. In the setup I tested, K stays at q8_0 and V goes to turbo3 (~3-bit). That asymmetric tradeoff makes sense because errors in the keys affect attention routing more directly, while values often tolerate heavier compression better. Benchmark 1: Mac Mini M4 16GB — Qwen3-14B Q4_K_M at 8K context → Without TurboQuant: KV cache 1280 MiB, K (f16): 640 MiB, V (f16): 640 MiB — 9.95 t/s → With TurboQuant: KV cache 465 MiB, K (q8_0): 340 MiB, V (turbo3): 125 MiB — 9.25 t/s Almost 3x compression, wit

Reddit r/LocalLLaMA

2mabout 3 hours ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 166 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Models

Models

As if Instagram wasn't enough: Russia proposes ban on foreign AI tools including ChatGPT and Claude - Cybernews

As if Instagram wasn't enough: Russia proposes ban on foreign AI tools including ChatGPT and Claude Cybernews

GNews AI Russia

1m17 days ago

Models

Vietnam IP law change fuels debate, uncertainty over AI training data - MLex

Vietnam IP law change fuels debate, uncertainty over AI training data MLex

Google News - AI Vietnam

1m4 months ago

ModelsFresh

Bench 2xMI50 Qwen3.5 27b vs Gemma4 31B (vllm-gfx906-mobydick)

Reddit r/LocalLLaMA

3mabout 4 hours ago

ModelsFresh

RTX 5090 gemma4-26b TG performance report

Nothing exhaustive... but I thought I'd report what I've seen from early testing. I'm running a modified version of vLLM that has NVFP4 support for gemma4-26b. Weights come in around 15.76 GiB and the remainder is KV cache. I'm running full context as well. For a "story telling" prompt and raw output with no thinking, I'm seeing about 150 t/s on TG. TTFT in streaming mode is about 80ms. Quality is good! submitted by /u/Nice_Cellist_7595 [link] [comments]

Reddit r/LocalLLaMA

1mabout 2 hours ago