Products claude llama mistral model benchmark version

Top LLM Gateways That Support Semantic Caching in 2026

DEV Communityby Debby McKinneyApril 2, 202612 min read0 views

Let me ask you something. How many times a day do your users ask your LLM app the same question, worded differently? "What is RAG?" and "Explain retrieval augmented generation to me" are the same question. You know it. I know it. But your LLM provider does not care. It charges you for both. Twice the tokens, twice the latency, same answer. This is where semantic caching comes in, and if you have not explored it yet, let me walk you through it before we look at the tools. TL;DR: Semantic caching matches LLM prompts by meaning, not exact strings, so rephrased questions return cached responses instead of burning tokens. I compared four tools that support it in 2026: Bifrost (fastest, most complete caching), LiteLLM (widest provider support), Kong AI Ga

Let me ask you something. How many times a day do your users ask your LLM app the same question, worded differently?

"What is RAG?" and "Explain retrieval augmented generation to me" are the same question. You know it. I know it. But your LLM provider does not care. It charges you for both. Twice the tokens, twice the latency, same answer.

This is where semantic caching comes in, and if you have not explored it yet, let me walk you through it before we look at the tools.

TL;DR: Semantic caching matches LLM prompts by meaning, not exact strings, so rephrased questions return cached responses instead of burning tokens. I compared four tools that support it in 2026: Bifrost (fastest, most complete caching), LiteLLM (widest provider support), Kong AI Gateway (enterprise plugin), and GPTCache (standalone library). The right pick depends on your stack and what you need beyond caching.

If you know what an LLM API call costs and have dealt with caching before, you are in the right place.

What Is Semantic Caching, and Why Should You Care?

Think of it like a library that remembers not just the exact title of a book you asked for, but also understands that "that novel about the kid wizard at boarding school" means you want Harry Potter.

Traditional caching matches strings exactly. "What is RAG?" and "what is RAG?" might match (if you lowercase). But "Explain retrieval augmented generation" will miss the cache entirely, even though it means the same thing.

Semantic caching converts your prompts into vector embeddings and compares meaning. If a new prompt is close enough in meaning to something already cached, you get the cached response back. No API call. No tokens billed. Near-zero latency.

Here is what this means for your bill: if even 30% of your queries are semantically similar to previous ones, you are looking at a meaningful reduction in API costs. And your users get faster responses.

Now, let me walk you through the gateways that actually support this in 2026.

1. Bifrost

Bifrost is an open-source LLM gateway written in Go. If you are running a production app and latency matters to you, here is the key number: 11 microseconds of overhead per request, handling 5,000 requests per second on a single instance.

But the part I want to highlight is how it handles semantic caching. It is not an afterthought here. It is a proper, dual-layer system.

Let me explain what dual-layer means:

Think of it as two bouncers at the door. The first one checks if your prompt is an exact match to something cached (fast hash lookup, no embedding cost). Only if that misses does the second bouncer kick in, doing a vector similarity search across your vector store to find semantically equivalent prompts.

This means you get the speed of exact matching for repeated queries and the intelligence of semantic matching for rephrased ones. You are not paying embedding costs when you do not need to.

Vector store support: Weaviate, Redis, and Qdrant.

Here is what makes it stand out:

Per-request TTL and similarity threshold overrides via HTTP headers (you control caching behaviour per call)
Model and provider isolation, so your GPT-4o cache does not mix with your Claude cache
Streaming response caching with proper chunk ordering
Cache type control to choose exact vs semantic per request
No-store control to skip writing to cache while still reading from it
ConversationHistoryThreshold (default 3) that skips caching for longer conversations to avoid false positive matches
Fallback routing if your primary provider goes down
Budget and rate limits through the governance layer
Built-in observability for monitoring cache hit rates

Important detail you should know: The cache key header (x-bf-cache-key) is mandatory. If you do not send it, requests bypass caching entirely. This is intentional. It gives you precise control over what gets cached and what does not.

Let me walk you through the setup:

First, get Bifrost running. One command:

npx -y @maximhq/bifrost

Enter fullscreen mode

Exit fullscreen mode

Follow the setup guide if you need help with the initial configuration.

Next, configure your vector store in config.json:

{  "vector_store": {  "enabled": true,  "type": "weaviate",  "config": {  "host": "localhost:8080",  "scheme": "http"  }  } }

{  "vector_store": {  "enabled": true,  "type": "weaviate",  "config": {  "host": "localhost:8080",  "scheme": "http"  }  } }

Enter fullscreen mode

Exit fullscreen mode

Then enable the semantic cache plugin:

{  "plugins": [  {  "enabled": true,  "name": "semantic_cache",  "config": {  "provider": "openai",  "embedding_model": "text-embedding-3-small",  "ttl": "5m",  "threshold": 0.8,  "conversation_history_threshold": 3,  "cache_by_model": true,  "cache_by_provider": true,  "cleanup_on_shutdown": true  }  }  ] }

{  "plugins": [  {  "enabled": true,  "name": "semantic_cache",  "config": {  "provider": "openai",  "embedding_model": "text-embedding-3-small",  "ttl": "5m",  "threshold": 0.8,  "conversation_history_threshold": 3,  "cache_by_model": true,  "cache_by_provider": true,  "cleanup_on_shutdown": true  }  }  ] }

Enter fullscreen mode

Exit fullscreen mode

Now send a request with caching enabled:

curl -X POST http://localhost:8080/v1/chat/completions \  -H "Content-Type: application/json" \  -H "x-bf-cache-key: session-123" \  -d '{  "model": "openai/gpt-4o-mini",  "messages": [{"role": "user", "content": "Explain what RAG is"}]  }'

curl -X POST http://localhost:8080/v1/chat/completions \  -H "Content-Type: application/json" \  -H "x-bf-cache-key: session-123" \  -d '{  "model": "openai/gpt-4o-mini",  "messages": [{"role": "user", "content": "Explain what RAG is"}]  }'

Enter fullscreen mode

Exit fullscreen mode

And here is the fun part. Send a differently worded version of the same question with custom threshold and TTL:

curl -X POST http://localhost:8080/v1/chat/completions \  -H "Content-Type: application/json" \  -H "x-bf-cache-key: session-123" \  -H "x-bf-cache-ttl: 30s" \  -H "x-bf-cache-threshold: 0.9" \  -H "x-bf-cache-type: semantic" \  -d '{  "model": "openai/gpt-4o-mini",  "messages": [{"role": "user", "content": "What is retrieval augmented generation?"}]  }'

curl -X POST http://localhost:8080/v1/chat/completions \  -H "Content-Type: application/json" \  -H "x-bf-cache-key: session-123" \  -H "x-bf-cache-ttl: 30s" \  -H "x-bf-cache-threshold: 0.9" \  -H "x-bf-cache-type: semantic" \  -d '{  "model": "openai/gpt-4o-mini",  "messages": [{"role": "user", "content": "What is retrieval augmented generation?"}]  }'

Enter fullscreen mode

Exit fullscreen mode

If the similarity crosses your threshold, that second call returns the cached response instantly. No provider call. You can verify cache behaviour using the benchmarking tools and the observability dashboard.

Check the full docs for every configuration option and edge case.

2. LiteLLM

LiteLLM is a Python-based proxy that supports 100+ LLM providers. If you are already using it for multi-provider routing, it does offer semantic caching through Redis or Qdrant.

How it works:

You set your cache mode to redis-semantic or qdrant-semantic. LiteLLM converts prompts into embeddings and runs similarity search against previously cached queries.

Vector store support: Redis (with the RediSearch module) and Qdrant.

What you get:

Seven cache backends total (in-memory, disk, Redis, S3, GCS, Qdrant, Redis semantic)
Dual-cache design with L1 in-memory and L2 Redis tiers for exact caching
Per-request cache control through headers and namespaces
Configurable similarity threshold

Setup example (Redis semantic):

litellm_settings:  cache: true  cache_params:  type: "redis-semantic"  similarity_threshold: 0.8  redis_semantic_cache_embedding_model: "text-embedding-3-small"  redis_semantic_cache_use_async: true

litellm_settings:  cache: true  cache_params:  type: "redis-semantic"  similarity_threshold: 0.8  redis_semantic_cache_embedding_model: "text-embedding-3-small"  redis_semantic_cache_use_async: true

Enter fullscreen mode

Exit fullscreen mode

Where it falls short: LiteLLM is written in Python and adds roughly 8ms of latency per request at the gateway level. The semantic caching works for basic use cases, but there is no streaming response caching, no per-model cache isolation, and no dual-layer approach (exact + semantic in the same pipeline). If you need those, something like Bifrost with its dual-layer caching is worth evaluating.

3. Kong AI Gateway

If you are already running Kong in production, you will be glad to know it added LLM support starting with version 3.6. The AI Semantic Cache plugin arrived in Kong Gateway 3.8.

How it works:

The plugin queries a Redis vector database for semantically similar requests. If a match crosses the configured threshold, the cached response is served without calling the LLM provider.

Vector store support: Redis (with vector search capabilities).

What you get:

Plugin-based architecture that fits into existing Kong deployments
Support for OpenAI and Mistral embedding models
Standard Kong plugin configuration via decK, Kubernetes CRDs, or Terraform

Here is the catch: Kong is a full API gateway platform. If you are setting it up just for LLM caching, you are taking on a lot of infrastructure complexity for a single feature. There is no dual-layer (exact + semantic) approach. And the AI plugins require Kong Gateway Enterprise or Kong Konnect, so it is not open-source in the way Bifrost or LiteLLM are.

4. GPTCache (Standalone Library)

GPTCache is not a gateway at all. It is an open-source Python library from Zilliz, purpose-built for semantic caching. I am including it because if you already have a setup and want to add caching, it might be what you need.

How it works:

GPTCache converts queries into embeddings and stores them in a vector store. On each new query, it does a similarity search and returns cached responses for matches.

Vector store support: Milvus, Zilliz Cloud, FAISS, and others.

What you get:

Direct integration with LangChain and LlamaIndex
Multiple embedding model options
SQLite-based storage with extensible interfaces

The trade-off: Since GPTCache is a library, you need to wire it into your application code yourself. There is no HTTP proxy, no provider routing, no failover, no request-level controls via headers. If you want a full gateway experience with semantic caching, fallbacks, and governance built in, you will need to pair GPTCache with another tool, or use a gateway like Bifrost that ships all of this natively.

Quick Comparison

Feature Bifrost LiteLLM Kong AI Gateway GPTCache

Type LLM Gateway LLM Proxy API Gateway Library

Language Go Python Lua/Go Python

Dual-layer caching (exact + semantic) Yes No (separate modes) No No

Streaming response caching Yes No No No

Per-request TTL/threshold overrides Yes (headers) Partial No No

Model/provider cache isolation Yes No No N/A

Vector store options Weaviate, Redis, Qdrant Redis, Qdrant Redis Milvus, FAISS

Open source Yes Yes (core) Enterprise plugin Yes

Gateway overhead 11 microseconds ~8ms Varies N/A

Trade-offs to Consider

Before I give you my recommendation, here is an honest look at the downsides of each option. No tool is perfect, and you should know what you are signing up for.

Bifrost — It is a newer project with a smaller community compared to LiteLLM or Kong. Self-hosted only, so you own the infrastructure and operations. If you need a managed service, this is not it (yet).
LiteLLM — The Python runtime adds ~8ms of gateway latency, which adds up at scale. But if you need to route across 100+ providers, nothing else comes close on breadth.
Kong AI Gateway — You are deploying a full API gateway platform for what might be a single feature. The semantic caching plugin requires an Enterprise or Konnect license. Heavy lift if LLM caching is your primary goal.
GPTCache — A library, not a gateway. No HTTP proxy, no routing, no failover. You wire it into your own code and handle everything else yourself.

Knowing these trade-offs upfront saves you from hitting surprises three months into production.

So, Which One Should You Pick?

If you are building something new and want semantic caching that actually covers the edge cases (streaming, model isolation, per-request control, dual-layer matching), Bifrost gives you the most complete implementation. The setup takes under a minute and you get observability and governance on top.

If you are already invested in Kong, the AI Semantic Cache plugin is reasonable. You know your infrastructure.

If you are using LiteLLM for its multi-provider support and need basic caching, its Redis semantic mode will do the job.

If you want a lightweight library to bolt onto an existing app, GPTCache is there for that.

For most teams I have seen, the pain starts when you need fine-grained cache control in production. Stale caches, mixed model responses, streaming issues. Starting with a gateway that handles these from day one saves you from stitching together workarounds later. You can get Bifrost running with a single command and start caching immediately. The GitHub repo has everything you need.

More about

claudellamamistral

Releases

Transcript for Demis Hassabis: Future of AI, Simulating Reality, Physics and Video Games Lex Fridman Podcast #475

This is a transcript of Lex Fridman Podcast #475 with Demis Hassabis. The timestamps in the transcript are clickable links that take you directly to that point in the main video. Please note that the transcript is human generated, and may have errors. Here are some useful links: Go back to this episode s main page Watch the full YouTube version of the podcast Table of Contents Here are the loose chapters in the conversation. Click link to jump approximately to that part in the transcript: 0:00 Episode highlight 1:21 Introduction 2:06 Learnable patterns in nature 5:48

Lex Fridman Blog

1m8 months ago

Models

Machine Learning at Scale: Managing More Than One Model in Production - Towards Data Science

Machine Learning at Scale: Managing More Than One Model in Production Towards Data Science

Google News: Machine Learning

1m24 days ago

ProductsLive

I Talked to 500GB of Retail Data With Zero Domain Knowledge. AI Designed a Strategy I never expected

I Talked to 500GB of Retail Data With Zero Domain Knowledge. The AI Designed a Strategy I Didn’t Know to Ask For. The end of a single investigation: the AI’s planogram optimisation strategy with projected business impact — +10% total sales, +£86,424 annual revenue. 134.9 seconds. $1.66 in compute. The agent is already suggesting what to investigate next. Here is what happened. A retail chain had 500GB of transactional data stored in Apache Iceberg. Sales records, promotions, product catalogues, store layouts, inventory movements — the full picture of a multi-store retail operation spread across multiple tables with complex relationships. I connected this data source to an agentic data platform I had built. One connection. No schema mapping, no ETL pipeline, no data dictionary. The system a

Generative AI

15mabout 2 hours ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 157 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

Top LLM Gateways That Support Semantic Caching in 2026

What Is Semantic Caching, and Why Should You Care?

1. Bifrost

2. LiteLLM

3. Kong AI Gateway

4. GPTCache (Standalone Library)

Quick Comparison

Trade-offs to Consider

So, Which One Should You Pick?

Further Reading

Daily AI Digest

More about

Transcript for Demis Hassabis: Future of AI, Simulating Reality, Physics and Video Games Lex Fridman Podcast #475

Machine Learning at Scale: Managing More Than One Model in Production - Towards Data Science

I Talked to 500GB of Retail Data With Zero Domain Knowledge. AI Designed a Strategy I never expected

Knowledge Map

Connected Articles — Knowledge Graph

Discussion

More in Products

Alibaba Unveils Qwen3.6-Plus to Accelerate Agentic AI Deployment for Enterprises and Alibaba’s AI Applications - bastillepost.com

From moon hotels to cattle herding: 8 startups investors chased at YC Demo Day

Why I'm Building Another AI Meeting Notes Tool Instead of Using Granola.ai

Quantum Simulations with Python