v4.3.3 - Gemma 4 support!
Changes Gemma 4 support with tool-calling in the API and UI. 🆕 - v4.3.1. ik_llama.cpp support : Add ik_llama.cpp as a new backend through new textgen-portable-ik portable builds and a new --ik flag for full installs. ik_llama.cpp is a fork by the author of the imatrix quants, including support for new quant types, significantly more accurate KV cache quantization (via Hadamard KV cache rotation, enabled by default), and optimizations for MoE models and CPU inference. API: Add echo + logprobs for /v1/completions . The completions endpoint now supports the echo and logprobs parameters, returning token-level log probabilities for both prompt and generated tokens. Token IDs are also included in the output via a new top_logprobs_ids field. Further optimize my custom gradio fork, saving up to 5
Changes
-
Gemma 4 support with tool-calling in the API and UI. 🆕 - v4.3.1.
-
ik_llama.cpp support: Add ik_llama.cpp as a new backend through new textgen-portable-ik portable builds and a new --ik flag for full installs. ik_llama.cpp is a fork by the author of the imatrix quants, including support for new quant types, significantly more accurate KV cache quantization (via Hadamard KV cache rotation, enabled by default), and optimizations for MoE models and CPU inference.
-
API: Add echo + logprobs for /v1/completions. The completions endpoint now supports the echo and logprobs parameters, returning token-level log probabilities for both prompt and generated tokens. Token IDs are also included in the output via a new top_logprobs_ids field.
-
Further optimize my custom gradio fork, saving up to 50 ms per UI event (button click, etc).
-
Transformers: Autodetect torch_dtype from model config instead of always forcing bfloat16/float16. The --bf16 flag still works as an override.
-
Remove the obsolete models/config.yaml file. Instruction templates are now detected from model metadata instead of filename patterns.
-
Rename "truncation length" to "context length" in the terminal log message.
Security
-
Gradio fork: Fix ACL bypass via case-insensitive path matching on Windows/macOS.
-
Gradio fork: Add server-side validation for Dropdown, Radio, and CheckboxGroup.
-
Sanitize filenames in all prompt file operations (CWE-22). Thanks, @ffulbtech. 🆕 - v4.3.3.
-
Fix SSRF in superbooga extensions: URLs fetched by superbooga/superboogav2 are now validated to block requests to private/internal networks.
Bug fixes
-
Fix --idle-timeout failing on encode/decode requests and not tracking parallel generation properly.
-
Fix stopping string detection for chromadb/context-1 (<|return|> vs <|result|>).
-
Fix Qwen3.5 MoE failing to load via ExLlamav3_HF.
-
Fix ban_eos_token not working for ExLlamav3. EOS is now suppressed at the logit level.
-
Fix "Value: None is not in the list of choices: []" Gradio error introduced in v4.3. 🆕 - v4.3.2.
-
Fix Dropdown/Radio/CheckboxGroup crash when choices list is empty. 🆕 - v4.3.3.
-
Fix API crash when parsing tool calls from non-dict JSON model output. 🆕 - v4.3.3.
-
Fix llama.cpp crashing due to failing to parse the Gemma 4 template (even though we don't use llama.cpp's jinja parser). 🆕 - v4.3.2.
Dependency updates
- Update llama.cpp to ggml-org/llama.cpp@277ff5f .
Adds Gemma-4 support Adds improved KV cache quantization via activations rotation, based on TurboQuant ggml-org/llama.cpp#21038
-
Update ik_llama.cpp to ikawrakow/ik_llama.cpp@d557d6c
-
Update ExLlamaV3 to 0.0.28
-
Update transformers to 5.5
Portable builds
Below you can find self-contained packages that work with GGUF models (llama.cpp) and require no installation! Just download the right version for your system, unzip/extract, and run.
Which version to download:
- Windows/Linux:
NVIDIA GPU
Older driver: Use cuda12.4. Newer driver (nvidia-smi reports CUDA Version >= 13.1): Use cuda13.1.
AMD/Intel GPU: Use vulkan. AMD GPU (ROCm): Use rocm. CPU only: Use cpu.
- Mac:
Apple Silicon: Use macos-arm64. Intel: Use macos-x86_64.
textgen-portable-ik is for ik-llama.cpp builds
Updating a portable install:
-
Download and extract the latest version.
-
Replace the user_data folder with the one in your existing install. All your settings and models will be moved.
Starting with 4.0, you can also move user_data one folder up, next to the install folder. It will be detected automatically, making updates easier:
text-generation-webui-4.0/ text-generation-webui-4.1/ user_data/ <-- shared by both installstext-generation-webui-4.0/ text-generation-webui-4.1/ user_data/ <-- shared by both installstext-gen-webui Releases
https://github.com/oobabooga/text-generation-webui/releases/tag/v4.3.3Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
llamamodeltransformer
AI slop got better, so now maintainers have more work
Once AI bug reports become plausible, someone still has to verify them If AI does more of the work but humans still have to check it, you need more reviewers. Now that AI models have gotten better at writing and evaluating code, open-source projects find themselves overwhelmed with the too-good-to-ignore output.…

AMD's AI director slams Claude Code for becoming dumber and lazier since last update
'Claude cannot be trusted to perform complex engineering tasks' according to GitHub ticket If you've noticed Claude Code's performance degrading to the point where you find you don't trust it to handle complicated tasks anymore, you're not alone.…

The Silicon Protocol: The Model Hosting Decision — When Azure OpenAI Isn’t Enough (And When It’s…
The Silicon Protocol: The Model Hosting Decision — When Azure OpenAI Isn’t Enough (And When It’s Overkill) The $187K infrastructure decision every healthcare CTO makes is wrong. Here’s the actual math on self-hosted vs. API vs. hybrid LLM deployment. The three hosting patterns healthcare organizations choose — and what each actually costs at scale. Most start left (API), graduate to center (Hybrid), few need right (Self-hosted). The compliance officer approved your de-identification pipeline. Legal signed off on the BAA. Security validated your OAuth tokens are governed by proper Passports. You’re ready to deploy your first production LLM in healthcare. Then engineering asks the question that determines whether you spend $50K or $250K this year: “Where does the model actually run?” Most he
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Open Source AI

VibeNVR v1.25.3 – Open-source, self-hosted NVR for IP cameras
I just released VibeNVR v1.25.3 , a modern open-source NVR (Network Video Recorder) for self-hosters who want full control over their home surveillance — no cloud, no subscriptions, no vendor lock-in. What is VibeNVR? VibeNVR is a containerized, privacy-first NVR built on FFmpeg. It lets you manage IP cameras, record continuously or on motion, and browse a unified event timeline — all from a clean React-based web UI running on your own hardware. Key Features Live view with multi-camera grid support Motion detection with configurable sensitivity Continuous event-based recording via FFmpeg Unified event timeline for easy review Docker-based — deploy in minutes with docker-compose Local storage only — your footage never leaves your network MIT licensed — fully open source What's new in v1.25.

Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!