Models claude gemini llama model cursor claude code

I was burning through AI tokens without realizing it. Here's how I fixed it.

Dev.to AIby Sergio Ramos VicenteApril 5, 20264 min read2 views

I've been using Claude Code and Codex daily for months. They're some of the best programming tools I've tried. But there's something nobody tells you when you start: context runs out fast, and the cost grows exponentially . The real problem isn't the message you're sending When you're 50 messages into a session and you send message 51, your CLI doesn't just send that message. It sends all 51 . The entire conversation, from the beginning, with every single request. On top of that, Claude Code's system prompt is 13,000 characters — also sent with every message. Every command result the AI has run, every file it read, every search it performed — all of it is in the history, resent again and again. In a real session, message 51 can end up sending 85,000 characters to the API. For a single mess

The real problem isn't the message you're sending

When you're 50 messages into a session and you send message 51, your CLI doesn't just send that message. It sends all 51. The entire conversation, from the beginning, with every single request.

On top of that, Claude Code's system prompt is 13,000 characters — also sent with every message. Every command result the AI has run, every file it read, every search it performed — all of it is in the history, resent again and again.

In a real session, message 51 can end up sending 85,000 characters to the API. For a single message.

Why existing tools don't fix this

There's a very popular tool for this problem: RTK (Rust Token Killer), with over 16,000 GitHub stars. It does exactly what it promises: it works as a shell wrapper that intercepts the stdout of each command before it enters the context. When the AI runs git diff, RTK filters the output before the result is stored in the history.

The problem isn't RTK — it's the scope of that approach.

Once a command result has entered the history, RTK can't touch it anymore. And on message 51, those 50 previous messages — with all their results, logs, file reads — are resent in full to the API. RTK has no visibility into the accumulated history.

In numbers: in a 50-turn session with 150,000 total tokens, RTK saves approximately 1.6%. It can only act on the current turn.

What I built

Squeezr is a local HTTP proxy that intercepts each request before it reaches the API. It operates at a different level than RTK: not on the stdout of a single command, but on the complete HTTP request — it sees and compresses the entire conversation on every send.

The system prompt is compressed once and cached. From 13,000 chars down to ~650. On the next request, and the one after, it comes straight from cache — no recompression.

Command and tool results are filtered before they accumulate in the history. When the AI runs npm test and gets 200 lines back, Squeezr extracts only the failing tests. When it reads a file, it keeps what's relevant. When it searches, it compacts the results. Git commands, Docker, kubectl, compilers, linters — each has its own specific pattern. And unlike RTK, Squeezr also compresses file reads and search results, not just bash output.

The full history is compressed with every request. Older messages are summarized automatically. Message 51 doesn't resend 50 full conversations — it resends 48 compressed ones and the last 3 intact.

The result on that same 85,000 char example: 25,000 chars. 71% less, on every message. In long sessions, cumulative savings reach 89%.

No quality loss

Compression is lossless. All original content is stored locally. If the AI needs more detail from something that was compressed, it calls squeezr_expand() and gets the full original back instantly — no cost, no API call.

The AI gets the same information. Without the filler.

AI compression uses the cheapest model you already have — no extra cost

When a block is too long for deterministic patterns, Squeezr uses an AI model to summarize it — always the cheapest one from the provider you're already using: Haiku if you're on Claude, GPT-4o-mini if you're on Codex, Flash if you're on Gemini. And if you work with local models through Ollama or LM Studio, it uses local models too. No extra API keys, no additional cost.

What changed in practice

Sessions last much longer. The AI keeps track because the context isn't filled with noise. And token spending dropped considerably:

squeezr gain

Squeezr — Token Savings -----------------------------------

Squeezr — Token Savings -----------------------------------

Requests processed: 33 Saved chars: 6,987,655 Total tokens saved: 1,912,840 Tool saving: 94,67% Context reduction: 78%

By Tool Read (161x): -83.8% WebFetch (25x): -60% Grep (15x): -66.4%`

Enter fullscreen mode

Exit fullscreen mode

squeezr discover shows you exactly which patterns are saving the most in your specific workflow. For me, vitest results and git diffs are the biggest wins.

How to try it

npm install -g squeezr-ai  squeezr setup  squeezr start

npm install -g squeezr-ai  squeezr setup  squeezr start

Enter fullscreen mode

Exit fullscreen mode

Works today with Claude Code, Codex, Aider, and Gemini CLI. Cursor support is coming soon.

MIT. https://squeezr.es

If you try it, squeezr gain will tell you exactly how much you're saving.

Original source

Dev.to AI

https://dev.to/sergioramosv/i-was-burning-through-ai-tokens-without-realizing-it-heres-how-i-fixed-it-bn

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

claudegeminillama

ReleasesLive

Google has announced Gemini updates to help when users ask about mental health issues - timesofindia.indiatimes.com

Google has announced Gemini updates to help when users ask about mental health issues timesofindia.indiatimes.com

Google News: Gemini

1mabout 1 hour ago

ModelsLive

Variational AI Releases Enki 4: Major Update to Foundation Model for Small-Molecule Drug Discovery - Business Wire

Variational AI Releases Enki 4: Major Update to Foundation Model for Small-Molecule Drug Discovery Business Wire

Google News: Generative AI

1mabout 1 hour ago

ProductsLive

The AI Gaslight

Why Tech Billionaires Are Selling a Utopia to Build an Empire From “vibe coding” tech debt to digital sweatshops — how the AI industry is sacrificing the working class to summon a machine we cannot control. A few weeks ago, I made a very public, very painful admission about building my startup, Nexa. Caught up in the deafening hype of the AI bubble, I stopped writing deep architectural code and started relying entirely on Large Language Models (LLMs) to “vibe code” my MVP. The AI acted like a sycophant. It flattered me. It told me my ideas were brilliant. It made me feel like a 10x engineer. But when real users touched the product, the system choked. Beneath the beautiful UI was a terrifying ocean of unscalable spaghetti code and suppressed errors. I realized the hard way that AI doesn’t m

Generative AI

9m29 minutes ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 201 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Models

ModelsLive

Variational AI Releases Enki 4: Major Update to Foundation Model for Small-Molecule Drug Discovery - Business Wire

Variational AI Releases Enki 4: Major Update to Foundation Model for Small-Molecule Drug Discovery Business Wire

Google News: Generative AI

1mabout 1 hour ago

ModelsLive

AI That Improves AI: What Happens When Agents Start Rewriting Themselves?

From Darwin Gödel Machine to HyperAgents-understanding how AI is evolving from static models to systems that continuously improve themselves What happens when an AI system is no longer just solving problems but also begins to change itself in order to solve them better? Not retrained. Not fine-tuned.But actively rewriting it’s own code, it’s own workflow and eventually improving the way it improves itself ! At first, it sounds like science fiction. However, the idea of a machine that can modify itself has been discussed for decades, earlier framed as a theoretical construct-something powerful yet out of reach . One of the earliest formulations imagined a system that could rewrite its own code but only after proving that the modification would lead to better performance. Agreed, it was a be

Generative AI

23m40 minutes ago

ModelsLive

Building a Business Valuation App in the GPT-5.4 Era

What four months of sample LLM usage taught us about turning AI from a smart demo into a production workflow Continue reading on Generative AI »

Generative AI

1m38 minutes ago

ModelsFresh

Bounded Autonomy: Controlling LLM Characters in Live Multiplayer Games

arXiv:2604.04703v1 Announce Type: new Abstract: Large language models (LLMs) are bringing richer dialogue and social behavior into games, but they also expose a control problem that existing game interfaces do not directly address: how should LLM characters participate in live multiplayer interaction while remaining executable in the shared game world, socially coherent with other active characters, and steerable by players when needed? We frame this problem as bounded autonomy, a control architecture for live multiplayer games that organizes LLM character control around three interfaces: agent-agent interaction, agent-world action execution, and player-agent steering. We instantiate bounded autonomy with probabilistic reply-chain decay, an embedding-based action grounding pipeline with fa

arXiv cs.HC

1mabout 10 hours ago