🔥 LMCache/LMCache
Hey there, little explorer! Imagine you have a super-duper smart robot friend, like a talking teddy bear, that knows everything!
Sometimes, when you ask your teddy bear lots of questions, it has to think really hard and remember things it already said. That takes time, right?
This new thing called LMCache is like a super-fast notebook for your teddy bear! It helps the teddy bear remember answers it already gave, or parts of answers, so it doesn't have to think from scratch every time.
This makes your smart teddy bear answer you much, much faster! It's like magic, making big computer brains super speedy! ✨
Supercharge Your LLM with the Fastest KV Cache Layer — Trending on GitHub today with 30 new stars.
| Blog | Documentation | Join Slack | Interest Form | Roadmap
Summary
LMCache is an LLM serving engine extension to reduce TTFT and increase throughput, especially under long-context scenarios. By storing the KV caches of reusable texts all over the datacenter (including GPU, CPU, Disk and even S3) with a wide range of acceleration technqiue (zero cpu copy, NIXL, GDS and more). LMCache reuses the KV caches of any reused text (not necessarily prefix) in any serving engine instance. Thus, LMCache saves precious GPU cycles and reduces user response delay.
By combining LMCache with vLLM, developers achieve 3-10x delay savings and GPU cycle reduction in many LLM use cases, including multi-round QA and RAG.
LMCache is used, integrated, or referenced across a growing ecosystem of LLM serving platforms, infrastructure providers, and open-source projects:
For more details, please check our Ray Summit talk and technical report.
Features
- 🔥 Integration with vLLM v1 with the following features:
High performance CPU KVCache offloading Disaggregated prefill P2P KVCache sharing
-
Integration with SGLang for KV cache offloading
-
Storage support as follows:
CPU Disk NIXL
- Installation support through pip and latest vLLM
Installation
To use LMCache, simply install lmcache from your package manager, e.g. pip:
pip install lmcache
Works on Linux NVIDIA GPU platform.
More detailed installation instructions are available in the docs, particularly if you are not using the latest stable version of vllm or using another serving engine with different dependencies. Any "undefined symbol" or torch mismatch versions can be resolved in the documentation.
Getting started
The best way to get started is to checkout the Quickstart Examples in the docs.
Documentation
Check out the LMCache documentation which is available online.
We also post regularly in LMCache blogs.
Examples
Go hands-on with our examples, demonstrating how to address different use cases with LMCache.
Interested in Connecting?
Fill out the interest form, sign up for our newsletter, join LMCache slack, or drop an email, and our team will reach out to you!
Community meeting
The community meeting Zoom Link for LMCache is hosted bi-weekly. All are welcome to join!
Meetings are held bi-weekly on: Tuesdays at 9:00 AM PT – Add to Google Calendar
We keep notes from each meeting on this document for summaries of standups, discussion, and action items.
Recordings of meetings are available on the YouTube LMCache channel.
Contributing
We welcome and value all contributions and collaborations. Please check out Contributing Guide on how to contribute.
We continually update [Onboarding] Welcoming contributors with good first issues!
Citation
If you use LMCache for your research, please cite our papers:
@inproceedings{liu2024cachegen, title={Cachegen: Kv cache compression and streaming for fast large language model serving}, author={Liu, Yuhan and Li, Hanchen and Cheng, Yihua and Ray, Siddhant and Huang, Yuyang and Zhang, Qizheng and Du, Kuntai and Yao, Jiayi and Lu, Shan and Ananthanarayanan, Ganesh and others}, booktitle={Proceedings of the ACM SIGCOMM 2024 Conference}, pages={38--56}, year={2024} }@inproceedings{liu2024cachegen, title={Cachegen: Kv cache compression and streaming for fast large language model serving}, author={Liu, Yuhan and Li, Hanchen and Cheng, Yihua and Ray, Siddhant and Huang, Yuyang and Zhang, Qizheng and Du, Kuntai and Yao, Jiayi and Lu, Shan and Ananthanarayanan, Ganesh and others}, booktitle={Proceedings of the ACM SIGCOMM 2024 Conference}, pages={38--56}, year={2024} }@article{cheng2024large, title={Do Large Language Models Need a Content Delivery Network?}, author={Cheng, Yihua and Du, Kuntai and Yao, Jiayi and Jiang, Junchen}, journal={arXiv preprint arXiv:2409.13761}, year={2024} }
@inproceedings{10.1145/3689031.3696098, author = {Yao, Jiayi and Li, Hanchen and Liu, Yuhan and Ray, Siddhant and Cheng, Yihua and Zhang, Qizheng and Du, Kuntai and Lu, Shan and Jiang, Junchen}, title = {CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion}, year = {2025}, url = {https://doi.org/10.1145/3689031.3696098}, doi = {10.1145/3689031.3696098}, booktitle = {Proceedings of the Twentieth European Conference on Computer Systems}, pages = {94–109}, }
@article{cheng2025lmcache, title={LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference}, author={Cheng, Yihua and Liu, Yuhan and Yao, Jiayi and An, Yuwei and Chen, Xiaokun and Feng, Shaoting and Huang, Yuyang and Shen, Samuel and Du, Kuntai and Jiang, Junchen}, journal={arXiv preprint arXiv:2510.09665}, year={2025} }`
Socials
Linkedin | Twitter | Youtube
License
The LMCache codebase is licensed under Apache License 2.0. See the LICENSE file for details.
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
githubtrendingopen-source
AI Doesn't Fix Your Development Problems. It Accelerates Them.
I've watched the same failure pattern play out across every technology wave of my career. Team gets a new tool that promises to change everything. Productivity numbers go up. Everyone celebrates. Six months later, they're drowning in the same late-stage rework they were drowning in before. Just more of it, arriving faster. I saw it with CASE tools in the nineties. With offshore development in the 2000s. With Agile transformations in the 2010s. With DevOps automation in the 2020s. AI code generation is the most powerful version of this pattern I've ever seen. And most engineering organizations are walking straight into it. The Illusion Looks Like This Your team adopts GitHub Copilot or a similar tool. A developer asks it to implement a user authentication module. In forty seconds, it produc

Why I Built Scenar.io - An AI-Powered DevOps Interview Practice Tool
Why I Built Scenar.io How It Started I was prepping for a Google SRE interview and struggling with the debugging portion. Not the knowledge - I knew the commands, I'd fixed real incidents at work. The problem was practicing under interview conditions: thinking out loud, explaining your reasoning, having someone challenge your approach. I started using Claude in the terminal to simulate it. I'd describe a scenario, ask it to act as a broken server, and practice talking through my debugging process. After a few weeks I realized I was spending more time setting up the prompts than actually practicing. I had this whole system - hidden server states, clue tracking, difficulty levels - and it hit me: this should just be a tool. I looked at what already existed. SadServers makes you type exact co

OAuth 2.0 Flows Demystified: Authorization Code, PKCE, and Client Credentials
OAuth 2.0 Is Not Authentication OAuth 2.0 is an authorization framework. It answers: "Can application X access resource Y on behalf of user Z?" OpenID Connect (OIDC) layers authentication on top: "Who is this user?" Most developers use both without realizing it. The Four Flows 1. Authorization Code Flow (Web Apps) The standard flow for web applications with a backend. Browser → Your App → GitHub/Google ("Allow access?") → Your App (with code) → Exchange code for token // Step 1: Redirect user to provider app . get ( ' /auth/github ' , ( req , res ) => { const state = generateRandomString ( 16 ); // CSRF protection req . session . oauthState = state ; const params = new URLSearchParams ({ client_id : process . env . GITHUB_CLIENT_ID ! , redirect_uri : ` ${ process . env . APP_URL } /auth/gi
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Open Source AI

Built an open source memory layer for local AI agents, runs fully offline, no cloud needed
I built an open source memory layer for AI agents called Octopoda. Runs entirely locally, no cloud, no API keys, no external services. Everything stays on your machine. The problem is pretty simple. Agents forget everything between sessions. Every time you restart your agent it starts from scratch like you never talked to it. I kept building hacky workarounds for this so eventually I just built a proper solution. It gives your agents persistent memory that survives restarts and crashes, semantic search so they can find memories by meaning not just exact keys, loop detection that catches when an agent is stuck doing the same thing over and over, messaging between agents so they can actually coordinate, crash recovery with snapshots you can roll back to, version history on every memory so yo

Android Instrumentation Testing in Continuous Integration: Practices, Patterns, and Performance
arXiv:2604.03438v1 Announce Type: new Abstract: Android instrumentation tests (end-to-end tests that run on a device or emulator) can catch problems that simpler tests miss. However, running these tests automatically in continuous integration (CI) is often difficult because emulator setup is fragile and configurations tend to drift over time. We study how open-source Android apps run instrumentation tests in CI by analyzing 4,518 repositories that use CI (snapshot: Aug. 10, 2025). We examine CI workflow files, scripts, and build configurations to identify cases where device setup is defined in Gradle (e.g., Gradle Managed Devices). Our results answer three questions about adoption, evolution, and outcomes. First, only about one in ten repositories (481/4,518; 10.6%) run instrumentation tes


Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!