Top LLM Gateways That Support Semantic Caching in 2026
<p>Let me ask you something. How many times a day do your users ask your LLM app the same question, worded differently?</p> <p>"What is RAG?" and "Explain retrieval augmented generation to me" are the same question. You know it. I know it. But your LLM provider does not care. It charges you for both. Twice the tokens, twice the latency, same answer.</p> <p>This is where semantic caching comes in, and if you have not explored it yet, let me walk you through it before we look at the tools.</p> <p><strong>TL;DR:</strong> Semantic caching matches LLM prompts by meaning, not exact strings, so rephrased questions return cached responses instead of burning tokens. I compared four tools that support it in 2026: Bifrost (fastest, most complete caching), LiteLLM (widest provider support), Kong AI Ga
Let me ask you something. How many times a day do your users ask your LLM app the same question, worded differently?
"What is RAG?" and "Explain retrieval augmented generation to me" are the same question. You know it. I know it. But your LLM provider does not care. It charges you for both. Twice the tokens, twice the latency, same answer.
This is where semantic caching comes in, and if you have not explored it yet, let me walk you through it before we look at the tools.
TL;DR: Semantic caching matches LLM prompts by meaning, not exact strings, so rephrased questions return cached responses instead of burning tokens. I compared four tools that support it in 2026: Bifrost (fastest, most complete caching), LiteLLM (widest provider support), Kong AI Gateway (enterprise plugin), and GPTCache (standalone library). The right pick depends on your stack and what you need beyond caching.
If you know what an LLM API call costs and have dealt with caching before, you are in the right place.
What Is Semantic Caching, and Why Should You Care?
Think of it like a library that remembers not just the exact title of a book you asked for, but also understands that "that novel about the kid wizard at boarding school" means you want Harry Potter.
Traditional caching matches strings exactly. "What is RAG?" and "what is RAG?" might match (if you lowercase). But "Explain retrieval augmented generation" will miss the cache entirely, even though it means the same thing.
Semantic caching converts your prompts into vector embeddings and compares meaning. If a new prompt is close enough in meaning to something already cached, you get the cached response back. No API call. No tokens billed. Near-zero latency.
Here is what this means for your bill: if even 30% of your queries are semantically similar to previous ones, you are looking at a meaningful reduction in API costs. And your users get faster responses.
Now, let me walk you through the gateways that actually support this in 2026.
1. Bifrost
Bifrost is an open-source LLM gateway written in Go. If you are running a production app and latency matters to you, here is the key number: 11 microseconds of overhead per request, handling 5,000 requests per second on a single instance.
But the part I want to highlight is how it handles semantic caching. It is not an afterthought here. It is a proper, dual-layer system.
Let me explain what dual-layer means:
Think of it as two bouncers at the door. The first one checks if your prompt is an exact match to something cached (fast hash lookup, no embedding cost). Only if that misses does the second bouncer kick in, doing a vector similarity search across your vector store to find semantically equivalent prompts.
This means you get the speed of exact matching for repeated queries and the intelligence of semantic matching for rephrased ones. You are not paying embedding costs when you do not need to.
Vector store support: Weaviate, Redis, and Qdrant.
Here is what makes it stand out:
-
Per-request TTL and similarity threshold overrides via HTTP headers (you control caching behaviour per call)
-
Model and provider isolation, so your GPT-4o cache does not mix with your Claude cache
-
Streaming response caching with proper chunk ordering
-
Cache type control to choose exact vs semantic per request
-
No-store control to skip writing to cache while still reading from it
-
ConversationHistoryThreshold (default 3) that skips caching for longer conversations to avoid false positive matches
-
Fallback routing if your primary provider goes down
-
Budget and rate limits through the governance layer
-
Built-in observability for monitoring cache hit rates
Important detail you should know: The cache key header (x-bf-cache-key) is mandatory. If you do not send it, requests bypass caching entirely. This is intentional. It gives you precise control over what gets cached and what does not.
Let me walk you through the setup:
First, get Bifrost running. One command:
npx -y @maximhq/bifrost
Enter fullscreen mode
Exit fullscreen mode
Follow the setup guide if you need help with the initial configuration.
Next, configure your vector store in config.json:
{ "vector_store": { "enabled": true, "type": "weaviate", "config": { "host": "localhost:8080", "scheme": "http" } } }{ "vector_store": { "enabled": true, "type": "weaviate", "config": { "host": "localhost:8080", "scheme": "http" } } }Enter fullscreen mode
Exit fullscreen mode
Then enable the semantic cache plugin:
{ "plugins": [ { "enabled": true, "name": "semantic_cache", "config": { "provider": "openai", "embedding_model": "text-embedding-3-small", "ttl": "5m", "threshold": 0.8, "conversation_history_threshold": 3, "cache_by_model": true, "cache_by_provider": true, "cleanup_on_shutdown": true } } ] }{ "plugins": [ { "enabled": true, "name": "semantic_cache", "config": { "provider": "openai", "embedding_model": "text-embedding-3-small", "ttl": "5m", "threshold": 0.8, "conversation_history_threshold": 3, "cache_by_model": true, "cache_by_provider": true, "cleanup_on_shutdown": true } } ] }Enter fullscreen mode
Exit fullscreen mode
Now send a request with caching enabled:
curl -X POST http://localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -H "x-bf-cache-key: session-123" \ -d '{ "model": "openai/gpt-4o-mini", "messages": [{"role": "user", "content": "Explain what RAG is"}] }'curl -X POST http://localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -H "x-bf-cache-key: session-123" \ -d '{ "model": "openai/gpt-4o-mini", "messages": [{"role": "user", "content": "Explain what RAG is"}] }'Enter fullscreen mode
Exit fullscreen mode
And here is the fun part. Send a differently worded version of the same question with custom threshold and TTL:
curl -X POST http://localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -H "x-bf-cache-key: session-123" \ -H "x-bf-cache-ttl: 30s" \ -H "x-bf-cache-threshold: 0.9" \ -H "x-bf-cache-type: semantic" \ -d '{ "model": "openai/gpt-4o-mini", "messages": [{"role": "user", "content": "What is retrieval augmented generation?"}] }'curl -X POST http://localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -H "x-bf-cache-key: session-123" \ -H "x-bf-cache-ttl: 30s" \ -H "x-bf-cache-threshold: 0.9" \ -H "x-bf-cache-type: semantic" \ -d '{ "model": "openai/gpt-4o-mini", "messages": [{"role": "user", "content": "What is retrieval augmented generation?"}] }'Enter fullscreen mode
Exit fullscreen mode
If the similarity crosses your threshold, that second call returns the cached response instantly. No provider call. You can verify cache behaviour using the benchmarking tools and the observability dashboard.
Check the full docs for every configuration option and edge case.
2. LiteLLM
LiteLLM is a Python-based proxy that supports 100+ LLM providers. If you are already using it for multi-provider routing, it does offer semantic caching through Redis or Qdrant.
How it works:
You set your cache mode to redis-semantic or qdrant-semantic. LiteLLM converts prompts into embeddings and runs similarity search against previously cached queries.
Vector store support: Redis (with the RediSearch module) and Qdrant.
What you get:
-
Seven cache backends total (in-memory, disk, Redis, S3, GCS, Qdrant, Redis semantic)
-
Dual-cache design with L1 in-memory and L2 Redis tiers for exact caching
-
Per-request cache control through headers and namespaces
-
Configurable similarity threshold
Setup example (Redis semantic):
litellm_settings: cache: true cache_params: type: "redis-semantic" similarity_threshold: 0.8 redis_semantic_cache_embedding_model: "text-embedding-3-small" redis_semantic_cache_use_async: truelitellm_settings: cache: true cache_params: type: "redis-semantic" similarity_threshold: 0.8 redis_semantic_cache_embedding_model: "text-embedding-3-small" redis_semantic_cache_use_async: trueEnter fullscreen mode
Exit fullscreen mode
Where it falls short: LiteLLM is written in Python and adds roughly 8ms of latency per request at the gateway level. The semantic caching works for basic use cases, but there is no streaming response caching, no per-model cache isolation, and no dual-layer approach (exact + semantic in the same pipeline). If you need those, something like Bifrost with its dual-layer caching is worth evaluating.
3. Kong AI Gateway
If you are already running Kong in production, you will be glad to know it added LLM support starting with version 3.6. The AI Semantic Cache plugin arrived in Kong Gateway 3.8.
How it works:
The plugin queries a Redis vector database for semantically similar requests. If a match crosses the configured threshold, the cached response is served without calling the LLM provider.
Vector store support: Redis (with vector search capabilities).
What you get:
-
Plugin-based architecture that fits into existing Kong deployments
-
Support for OpenAI and Mistral embedding models
-
Standard Kong plugin configuration via decK, Kubernetes CRDs, or Terraform
Here is the catch: Kong is a full API gateway platform. If you are setting it up just for LLM caching, you are taking on a lot of infrastructure complexity for a single feature. There is no dual-layer (exact + semantic) approach. And the AI plugins require Kong Gateway Enterprise or Kong Konnect, so it is not open-source in the way Bifrost or LiteLLM are.
4. GPTCache (Standalone Library)
GPTCache is not a gateway at all. It is an open-source Python library from Zilliz, purpose-built for semantic caching. I am including it because if you already have a setup and want to add caching, it might be what you need.
How it works:
GPTCache converts queries into embeddings and stores them in a vector store. On each new query, it does a similarity search and returns cached responses for matches.
Vector store support: Milvus, Zilliz Cloud, FAISS, and others.
What you get:
-
Direct integration with LangChain and LlamaIndex
-
Multiple embedding model options
-
SQLite-based storage with extensible interfaces
The trade-off: Since GPTCache is a library, you need to wire it into your application code yourself. There is no HTTP proxy, no provider routing, no failover, no request-level controls via headers. If you want a full gateway experience with semantic caching, fallbacks, and governance built in, you will need to pair GPTCache with another tool, or use a gateway like Bifrost that ships all of this natively.
Quick Comparison
Feature Bifrost LiteLLM Kong AI Gateway GPTCache
Type LLM Gateway LLM Proxy API Gateway Library
Language Go Python Lua/Go Python
Dual-layer caching (exact + semantic) Yes No (separate modes) No No
Streaming response caching Yes No No No
Per-request TTL/threshold overrides Yes (headers) Partial No No
Model/provider cache isolation Yes No No N/A
Vector store options Weaviate, Redis, Qdrant Redis, Qdrant Redis Milvus, FAISS
Open source Yes Yes (core) Enterprise plugin Yes
Gateway overhead 11 microseconds ~8ms Varies N/A
Trade-offs to Consider
Before I give you my recommendation, here is an honest look at the downsides of each option. No tool is perfect, and you should know what you are signing up for.
-
Bifrost — It is a newer project with a smaller community compared to LiteLLM or Kong. Self-hosted only, so you own the infrastructure and operations. If you need a managed service, this is not it (yet).
-
LiteLLM — The Python runtime adds ~8ms of gateway latency, which adds up at scale. But if you need to route across 100+ providers, nothing else comes close on breadth.
-
Kong AI Gateway — You are deploying a full API gateway platform for what might be a single feature. The semantic caching plugin requires an Enterprise or Konnect license. Heavy lift if LLM caching is your primary goal.
-
GPTCache — A library, not a gateway. No HTTP proxy, no routing, no failover. You wire it into your own code and handle everything else yourself.
Knowing these trade-offs upfront saves you from hitting surprises three months into production.
So, Which One Should You Pick?
If you are building something new and want semantic caching that actually covers the edge cases (streaming, model isolation, per-request control, dual-layer matching), Bifrost gives you the most complete implementation. The setup takes under a minute and you get observability and governance on top.
If you are already invested in Kong, the AI Semantic Cache plugin is reasonable. You know your infrastructure.
If you are using LiteLLM for its multi-provider support and need basic caching, its Redis semantic mode will do the job.
If you want a lightweight library to bolt onto an existing app, GPTCache is there for that.
For most teams I have seen, the pain starts when you need fine-grained cache control in production. Stale caches, mixed model responses, streaming issues. Starting with a gateway that handles these from day one saves you from stitching together workarounds later. You can get Bifrost running with a single command and start caching immediately. The GitHub repo has everything you need.
Further Reading
-
Bifrost Semantic Caching docs — Full configuration reference for dual-layer caching
-
Vector Store setup guide — Weaviate, Redis, and Qdrant integration details
-
Bifrost Quickstart — Get running in under a minute
-
Governance and Budget Limits — Control spend alongside caching
-
Benchmarking tools — Measure cache hit rates and latency in your own environment
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
claudellamamistralTranscript for Demis Hassabis: Future of AI, Simulating Reality, Physics and Video Games Lex Fridman Podcast #475
This is a transcript of Lex Fridman Podcast #475 with Demis Hassabis. The timestamps in the transcript are clickable links that take you directly to that point in the main video. Please note that the transcript is human generated, and may have errors. Here are some useful links: Go back to this episode s main page Watch the full YouTube version of the podcast Table of Contents Here are the loose chapters in the conversation. Click link to jump approximately to that part in the transcript: 0:00 Episode highlight 1:21 Introduction 2:06 Learnable patterns in nature 5:48
I Talked to 500GB of Retail Data With Zero Domain Knowledge. AI Designed a Strategy I never expected
I Talked to 500GB of Retail Data With Zero Domain Knowledge. The AI Designed a Strategy I Didn’t Know to Ask For. The end of a single investigation: the AI’s planogram optimisation strategy with projected business impact — +10% total sales, +£86,424 annual revenue. 134.9 seconds. $1.66 in compute. The agent is already suggesting what to investigate next. Here is what happened. A retail chain had 500GB of transactional data stored in Apache Iceberg. Sales records, promotions, product catalogues, store layouts, inventory movements — the full picture of a multi-store retail operation spread across multiple tables with complex relationships. I connected this data source to an agentic data platform I had built. One connection. No schema mapping, no ETL pipeline, no data dictionary. The system a
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.

Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!