b8638

llama.cpp Releasesby github-actions[bot]April 2, 20261 min read0 views

tests: allow exporting graph ops from HF file without downloading weights ( #21182 ) tests: allow exporting graph ops from HF file without downloading weights use unique_ptr for llama_context in HF metadata case fix missing non-required tensors falling back to type f32 use unique pointers where possible use no_alloc instead of fixing f32 fallback fix missing space macOS/iOS: macOS Apple Silicon (arm64) macOS Intel (x64) iOS XCFramework Linux: Ubuntu x64 (CPU) Ubuntu arm64 (CPU) Ubuntu s390x (CPU) Ubuntu x64 (Vulkan) Ubuntu arm64 (Vulkan) Ubuntu x64 (ROCm 7.2) Ubuntu x64 (OpenVINO) Windows: Windows x64 (CPU) Windows arm64 (CPU) Windows x64 (CUDA 12) - CUDA 12.4 DLLs Windows x64 (CUDA 13) - CUDA 13.1 DLLs Windows x64 (Vulkan) Windows x64 (SYCL) Windows x64 (HIP) openEuler: openEuler x86 (310

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Original source

llama.cpp Releases

https://github.com/ggml-org/llama.cpp/releases/tag/b8638

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

llama

ModelsFresh

llama.cpp Gemma 4 using up all system RAM on larger prompts

Something I'm noticing that I don't think I've noticed before. I've been testing out Gemma 4 31B with 32GB of VRAM and 64GB of DDR5. I can load up the UD_Q5_K_XL Unsloth quant with about 100k context with plenty of VRAM headroom, but what ends up killing me is sending a few prompts and the actual system RAM fills up and the process gets terminated for OOM, not a GPU or CUDA OOM, like Linux killing it because llama.cpp was using 63GB of system RAM. I've since switched to another slower PC with a bunch of older GPUs where I have with 128GB of DDR4, and while I've got heaps of GPU VRAM spare there, it still eats into the system RAM, but gives me a bigger buffer before the large prompts kill the process, so is more usable. Although I've been running a process for a little while now that has be

Reddit r/LocalLLaMA

2mabout 5 hours ago

ModelsFresh

Steerable but Not Decodable: Function Vectors Operate Beyond the Logit Lens

arXiv:2604.02608v1 Announce Type: new Abstract: Function vectors (FVs) -- mean-difference directions extracted from in-context learning demonstrations -- can steer large language model behavior when added to the residual stream. We hypothesized that FV steering failures reflect an absence of task-relevant information: the logit lens would fail alongside steering. We were wrong. In the most comprehensive cross-template FV transfer study to date - 4,032 pairs across 12 tasks, 6 models from 3 families (Llama-3.1-8B, Gemma-2-9B, Mistral-7B-v0.3; base and instruction-tuned), 8 templates per task - we find the opposite dissociation: FV steering succeeds even when the logit lens cannot decode the correct answer at any layer. This steerability-without-decodability pattern is universal: steering ex

arXiv cs.LG

2mabout 7 hours ago

ModelsFresh

Fast NF4 Dequantization Kernels for Large Language Model Inference

arXiv:2604.02556v1 Announce Type: new Abstract: Large language models (LLMs) have grown beyond the memory capacity of single GPU devices, necessitating quantization techniques for practical deployment. While NF4 (4-bit NormalFloat) quantization enables 4$\times$ memory reduction, inference on current NVIDIA GPUs (e.g., Ampere A100) requires expensive dequantization back to FP16 format, creating a critical performance bottleneck. This paper presents a lightweight shared memory optimization that addresses this gap through principled memory hierarchy exploitation while maintaining full ecosystem compatibility. We compare our technique against the open-source BitsAndBytes implementation, achieving 2.0--2.2$\times$ kernel speedup across three models (Gemma 27B, Qwen3 32B, and Llama3.3 70B) and

arXiv cs.LG

2mabout 7 hours ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 225 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Models

ModelsRecent

Microsoft takes on Google and OpenAI with its own AI models - Digital Trends

Microsoft takes on Google and OpenAI with its own AI models Digital Trends

GNews AI Microsoft

1m1 day ago

ModelsLive

DeepSeek V4 points to growing use of Huawei chips in AI models - Tech Wire Asia

DeepSeek V4 points to growing use of Huawei chips in AI models Tech Wire Asia

Google News: Generative AI

1mabout 2 hours ago

ModelsFresh

An Open-Source LiDAR and Monocular Off-Road Autonomous Navigation Stack

arXiv:2604.03096v1 Announce Type: new Abstract: Off-road autonomous navigation demands reliable 3D perception for robust obstacle detection in challenging unstructured terrain. While LiDAR is accurate, it is costly and power-intensive. Monocular depth estimation using foundation models offers a lightweight alternative, but its integration into outdoor navigation stacks remains underexplored. We present an open-source off-road navigation stack supporting both LiDAR and monocular 3D perception without task-specific training. For the monocular setup, we combine zero-shot depth prediction (Depth Anything V2) with metric depth rescaling using sparse SLAM measurements (VINS-Mono). Two key enhancements improve robustness: edge-masking to reduce obstacle hallucination and temporal smoothing to mit

arXiv cs.RO

1mabout 7 hours ago

ModelsFresh

FSUNav: A Cerebrum-Cerebellum Architecture for Fast, Safe, and Universal Zero-Shot Goal-Oriented Navigation

arXiv:2604.03139v1 Announce Type: new Abstract: Current vision-language navigation methods face substantial bottlenecks regarding heterogeneous robot compatibility, real-time performance, and navigation safety. Furthermore, they struggle to support open-vocabulary semantic generalization and multimodal task inputs. To address these challenges, this paper proposes FSUNav: a Cerebrum-Cerebellum architecture for fast, safe, and universal zero-shot goal-oriented navigation, which innovatively integrates vision-language models (VLMs) with the proposed architecture. The cerebellum module, a high-frequency end-to-end module, develops a universal local planner based on deep reinforcement learning, enabling unified navigation across heterogeneous platforms (e.g., humanoid, quadruped, wheeled robots

arXiv cs.RO

1mabout 7 hours ago