New ways to balance cost and reliability in the Gemini API
Google is introducing two new inference tiers to the Gemini API, Flex and Priority, to balance cost and latency.
Apr 02, 2026
Introducing Flex and Priority inference: advanced controls for developers to optimize costs and reliability through a single, unified interface.
Hussein Hassan Harrirou
Engineering, Gemini API
Sorry, your browser doesn't support embedded videos, but don't worry, you can download it and watch it with your favorite video player!
Your browser does not support the audio element.
Listen to article
This content is generated by Google AI. Generative AI is experimental
[[duration]] minutes
Today, we are adding two new service tiers to the Gemini API: Flex and Priority. These new options give you granular control over cost and reliability through a single, unified interface.
As AI evolves from simple chat into complex, autonomous agents, developers typically have to manage two distinct types of logic:
- Background tasks: High-volume workflows like data enrichment or "thinking" processes that don't need instant responses.
- Interactive tasks: User-facing features like chatbots and copilots where high reliability is needed.
Until now, supporting both meant splitting your architecture between standard synchronous serving and the asynchronous Batch API. Flex and Priority help to bridge this gap. You can now route background jobs to Flex and interactive jobs to Priority, both using standard synchronous endpoints. This eliminates the complexity of async job management while giving you the economic and performance benefits of specialized tiers.
Flex Inference: scale innovation for 50% less
Flex Inference is our new cost-optimized tier, designed for latency-tolerant workloads without the overhead of batch processing.
- 50% price savings: Pay half the price of the Standard API by downgrading criticality of your request (making them less reliable, and adding latency).
- Synchronous simplicity: Unlike the Batch API, Flex is a synchronous interface. You use the same familiar endpoints without managing input/output files or polling for job completion.
- Ideal use cases: Background CRM updates, large-scale research simulations, and agentic workflows where the model "browses" or "thinks" in the background.
Get started fast by simply configuring the service_tier parameter in your request:
Flex tier will be available for all paid tiers and is available for GenerateContent and Interactions API requests.
Priority Inference: Highest reliability for critical apps
The new Priority Inference tier offers our highest level of assurance at a premium price point. This helps to ensure your most important traffic is not preempted, even during peak platform usage.
- Highest criticality: Priority requests get highest criticality leading to higher reliability, even during peak load.
- Graceful downgrade: If your traffic exceeds your Priority limits, overflow requests are automatically served at the Standard tier instead of failing. This keeps your application online and helps to ensure business continuity.
- Transparent response: The API response indicates which tier served your request, giving you full visibility into your performance and billing.
- Ideal use cases: Real-time customer support bots, live content moderation pipelines, and time-sensitive requests.
To use Priority Inference, simply set the service_tier parameter accordingly:
Priority inference will be available to users with Tier 2 / 3 paid projects across the GenerateContent API and Interactions API endpoints.
Visit the Gemini API documentation to see the full pricing breakdown and start optimizing your production tiers today. To see it in action, check out the cookbook for runnable code examples.
Related stories
.
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
gemini
ShowDev: I Built an AI-Powered "Viral Reel Idea Machine" (Custom PHP + Gemini AI) 🚀
Hey DEV Community! 👋 Ever found yourself staring at a blank screen, convinced your brain cells have collectively gone on strike, all while trying to conjure up a viral short-form video idea? Whether you're a developer battling for brand visibility, a tech creator seeking TikTok fame, or just someone trying to conquer Instagram Reels or YouTube Shorts, you know the soul-crushing agony of the 'empty idea' void. Yep, I've been there. My projects needed promotion, and my idea well was drier than a forgotten code comment. So, like any self-respecting developer, I did what comes naturally: instead of, you know, thinking of ideas, I built a tool to do it for me. Behold! The Viral Reel Idea Machine on Zlvox. 🚀 It's less 'Skynet' and more 'your incredibly productive, always-caffeinated intern'. �

oh-my-claudecode is a Game Changer: Experiencing Local AI Swarm Orchestration
While the official Claude Code CLI has been making waves recently, I stumbled upon a tool that pushes its potential to the absolute limit: oh-my-claudecode (OMC) . More than just a coding assistant, OMC operates on the concept of local swarm orchestration for AI agents . It’s been featured in various articles and repos, but after spinning it up locally, I can confidently say this is a paradigm shift in the developer experience. Here is my hands-on review and why I think it’s worth adding to your stack. Why is oh-my-claudecode so powerful? If the standard Claude Code is like having a brilliant junior developer sitting next to you, OMC is like hiring an entire elite engineering team . Instead of relying on a single AI to handle everything sequentially, OMC leverages multiple specialized agen
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Models

I Tried to Automate a Manual Review Task with Claude. It Wasn't Worth It.
Every day, a CI job adds new entries to test-titles.json in my Clusterflick repo. When it finds a cinema listing title the normaliser hasn't seen before, it records the input and the current output, then opens a pull request. Someone — usually me — then has to review whether those outputs are actually correct, fix anything that isn't, and merge. It's not complicated work. Review the output and confirm the normalizer has done the correct job. If it hasn't, fix the output (test now fails ❌) and then fix the normalizer (until the test now passes ✅). But it happens twice day, and "not complicated" doesn't mean "not context switching". So I decided to try automating it with Claude. Several hours and $5 later, I don't think it was worth it — and I think the reasons why are worth writing up 💸 Th

How the JavaScript Event Loop Creates the Illusion of Multithreading
If you’ve worked with JavaScript, you’ve probably heard: “JavaScript is single-threaded.” And then immediately wondered: “Then how does it handle multiple things at once?” This confusion usually comes from one misleading idea: JavaScript does NOT simulate multithreading — it achieves concurrency by avoiding blocking. Let’s break this down clearly. The Core Truth JavaScript runs on a single thread . That means: One call stack One task at a time No parallel execution of your JS code So if that’s true… how does it handle multiple tasks? The Real Trick: Don’t Do the Work Instead of running everything itself, JavaScript: Executes fast tasks immediately Delegates slow tasks to the runtime (browser / Node.js) Continues executing other code Handles results later This is the entire model. The Syste

Я протестировал 8 бесплатных аналогов ChatGPT на русском
Наступила ночь пятницы, за окном течёт жизнь Киева, а я сижу и смотрю на монитор, видя, как ушли $96 за подписки на AI-инструменты. В голове только одна мысль - сколько из этих денег было потрачено зря? Открываю таблицу с расходами за квартал, и цифра $1 140 за подписки ударяет как молот. Эти сервисы я использовал только треть времени. Меня это бесит. Я собрал промпты по этой теме в PDF. Забери бесплатно: https://t.me/airozov_bot Альтернативы, которые не подводят Начал эксперимент с поиска бесплатных аналогов ChatGPT на русском. Первая находка - это Claude . Он не только не уступает, но часто и превосходит. Особенно в сложных кодовых задачах. Второй находкой стал Cursor . В отличие от ChatGPT, он предлагает более интуитивный интерфейс для разработчиков. В конечном итоге, оба инструмента по

Why Your Agent Works Great in Demos But Fails in Production
Everyone has seen the demo. An AI agent performs some task flawlessly. Book a flight. Summarize a document. Answer questions from a database. The demo is impressive. Then you try it in production. Here is the problem: demos are curated environments. Production is not. What works in demos often breaks under real-world conditions. Because real users push boundaries that demos never test. They ask unexpected questions. They skip steps. They misunderstand instructions and act on incomplete inputs. What breaks agents in the wild: Context Drift - As conversations get longer, agents lose track of earlier context. Error Cascades - One bad output becomes input for the next step, compounding errors. Side Effects - Changing one part of a workflow often breaks dependencies elsewhere. The difference be


Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!