Open Source AI llama model transformer version update report

v4.3.2

text-gen-webui Releasesby oobaboogaApril 3, 20263 min read1 views

Changes Gemma 4 support with full tool-calling in the API and UI. 🆕 ik_llama.cpp support : Add ik_llama.cpp as a new backend through new textgen-portable-ik portable builds and a new --ik flag for full installs. ik_llama.cpp is a fork by the author of the imatrix quants, including support for new quant types, significantly more accurate KV cache quantization (via Hadamard KV cache rotation, enabled by default), and optimizations for MoE models and CPU inference. API: Add echo + logprobs for /v1/completions . The completions endpoint now supports the echo and logprobs parameters, returning token-level log probabilities for both prompt and generated tokens. Token IDs are also included in the output via a new top_logprobs_ids field. Further optimize my custom gradio fork, saving up to 50 ms

Changes

Gemma 4 support with full tool-calling in the API and UI. 🆕
ik_llama.cpp support: Add ik_llama.cpp as a new backend through new textgen-portable-ik portable builds and a new --ik flag for full installs. ik_llama.cpp is a fork by the author of the imatrix quants, including support for new quant types, significantly more accurate KV cache quantization (via Hadamard KV cache rotation, enabled by default), and optimizations for MoE models and CPU inference.
API: Add echo + logprobs for /v1/completions. The completions endpoint now supports the echo and logprobs parameters, returning token-level log probabilities for both prompt and generated tokens. Token IDs are also included in the output via a new top_logprobs_ids field.
Further optimize my custom gradio fork, saving up to 50 ms per UI event (button click, etc).
Transformers: Autodetect torch_dtype from model config instead of always forcing bfloat16/float16. The --bf16 flag still works as an override.
Remove the obsolete models/config.yaml file. Instruction templates are now detected from model metadata instead of filename patterns.
Rename "truncation length" to "context length" in the terminal log message.

Security

Gradio fork: Fix ACL bypass via case-insensitive path matching on Windows/macOS.
Gradio fork: Add server-side validation for Dropdown, Radio, and CheckboxGroup.
Fix SSRF in superbooga extensions: URLs fetched by superbooga/superboogav2 are now validated to block requests to private/internal networks.

Bug fixes

Fix --idle-timeout failing on encode/decode requests and not tracking parallel generation properly.
Fix stopping string detection for chromadb/context-1 (<|return|> vs <|result|>).
Fix Qwen3.5 MoE failing to load via ExLlamav3_HF.
Fix ban_eos_token not working for ExLlamav3. EOS is now suppressed at the logit level.
Fix "Value: None is not in the list of choices: []" Gradio error introduced in v4.3 🆕
Fix llama.cpp crashing due to failing to parse the Gemma 4 template (even though we don't use llama.cpp's jinja parser) 🆕

Dependency updates

Update llama.cpp to ggml-org/llama.cpp@277ff5f .

Adds Gemma-4 support Adds improved KV cache quantization via activations rotation, based on TurboQuant ggml-org/llama.cpp#21038

Update ik_llama.cpp to ikawrakow/ik_llama.cpp@d557d6c
Update ExLlamaV3 to 0.0.28
Update transformers to 5.5

Portable builds

Below you can find self-contained packages that work with GGUF models (llama.cpp) and require no installation! Just download the right version for your system, unzip/extract, and run.

Which version to download:

Windows/Linux:

NVIDIA GPU

Older driver: Use cuda12.4. Newer driver (nvidia-smi reports CUDA Version >= 13.1): Use cuda13.1.

AMD/Intel GPU: Use vulkan. AMD GPU (ROCm): Use rocm. CPU only: Use cpu.

Mac:

Apple Silicon: Use macos-arm64. Intel: Use macos-x86_64.

textgen-portable-ik is for ik-llama.cpp builds

Updating a portable install:

Download and extract the latest version.
Replace the user_data folder with the one in your existing install. All your settings and models will be moved.

Starting with 4.0, you can also move user_data one folder up, next to the install folder. It will be detected automatically, making updates easier:

text-generation-webui-4.0/ text-generation-webui-4.1/ user_data/ <-- shared by both installs

text-generation-webui-4.0/ text-generation-webui-4.1/ user_data/ <-- shared by both installs

Original source

text-gen-webui Releases

https://github.com/oobabooga/text-generation-webui/releases/tag/v4.3.2

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

llamamodeltransformer

ProductsFresh

AI Is Insatiable

While browsing our website a few weeks ago, I stumbled upon “ How and When the Memory Chip Shortage Will End ” by Senior Editor Samuel K. Moore. His analysis focuses on the current DRAM shortage caused by AI hyperscalers’ ravenous appetite for memory, a major constraint on the speed at which large language models run. Moore provides a clear explanation of the shortage, particularly for high bandwidth memory (HBM). As we and the rest of the tech media have documented, AI is a resource hog. AI electricity consumption could account for up to 12 percent of all U.S. power by 2028. Generative AI queries consumed 15 terawatt-hours in 2025 and are projected to consume 347 TWh by 2030. Water consumption for cooling AI data centers is predicted to double or even quadruple by 2028 compared to 2023. B

IEEE Spectrum AI

3mabout 4 hours ago

Models

New AI foundation model aims to speed up drug discovery - Drug Target Review

New AI foundation model aims to speed up drug discovery Drug Target Review

GNews AI drug discovery

1mabout 1 month ago

ModelsLive

Anyone got Gemma 4 26B-A4B running on VLLM?

If yes, which quantized model are you using abe what’s your vllm serve command? I’ve been struggling getting that model up and running on my dgx spark gb10. I tried the intel int4 quant for the 31B and it seems to be working well but way too slow. Anyone have any luck with the 26B? submitted by /u/toughcentaur9018 [link] [comments]

Reddit r/LocalLLaMA

1mabout 2 hours ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 208 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Open Source AI

Open Source AILive

🔥 ggml-org/llama.cpp

LLM inference in C/C++ — Trending on GitHub today with 318 new stars.

GitHub Trending

13mabout 1 hour ago

Open Source AILive

🔥 ollama/ollama

Get up and running with Kimi-K2.5, GLM-5, MiniMax, DeepSeek, gpt-oss, Qwen, Gemma and other models. — Trending on GitHub today with 263 new stars.

GitHub Trending

8mabout 1 hour ago

Open Source AILive

🔥 sponsors/kepano

Agent skills for Obsidian. Teach your agent to use Markdown, Bases, JSON Canvas, and use the CLI. — Trending on GitHub today with 281 new stars.

GitHub Trending

1mabout 1 hour ago

Open Source AILive

🔥 KeygraphHQ/shannon

Shannon Lite is an autonomous, white-box AI pentester for web applications and APIs. It analyzes your source code, identifies attack vectors, and executes real exploits to prove vulnerabilities before they reach production. — Trending on GitHub today with 703 new stars.

GitHub Trending

23mabout 1 hour ago