Live
Black Hat USAAI BusinessBlack Hat AsiaAI BusinessWhy Your Agent Works Great in Demos But Fails in ProductionDev.to AIЯ протестировал 8 бесплатных аналогов ChatGPT на русскомDev.to AIHow the JavaScript Event Loop Creates the Illusion of MultithreadingDev.to AIShowDev: I Built an AI-Powered "Viral Reel Idea Machine" (Custom PHP + Gemini AI) 🚀Dev.to AIGovernments Lock Down Biometric IDs — Investigators Get Left OutsideDev.to AIDay 6: My Autonomy Tool Got a CVE — 894 Points on HN While I Was AsleepDev.to AIArchitecture Is the Missing Layer in AI Harness EngineeringDev.to AI🚀 Wie ich ein AI Growth System gebaut habe, das konstant Leads liefert (kein Bullshit)Dev.to AIBuilding Production-Ready Agentic AI Systems for Enterprise Software DeliveryDev.to AII Tried to Automate a Manual Review Task with Claude. It Wasn't Worth It.Dev.to AIWetware AI: Living Brain Cells Trained to Run Chaos Math - Neuroscience NewsGoogle News: Machine LearningNvidia AI tech claims to slash VRAM usage by 85% with zero quality loss — Neural Texture Compression demo reveals stunning visual parity between 6.5GB of memory and 970MBtomshardware.comBlack Hat USAAI BusinessBlack Hat AsiaAI BusinessWhy Your Agent Works Great in Demos But Fails in ProductionDev.to AIЯ протестировал 8 бесплатных аналогов ChatGPT на русскомDev.to AIHow the JavaScript Event Loop Creates the Illusion of MultithreadingDev.to AIShowDev: I Built an AI-Powered "Viral Reel Idea Machine" (Custom PHP + Gemini AI) 🚀Dev.to AIGovernments Lock Down Biometric IDs — Investigators Get Left OutsideDev.to AIDay 6: My Autonomy Tool Got a CVE — 894 Points on HN While I Was AsleepDev.to AIArchitecture Is the Missing Layer in AI Harness EngineeringDev.to AI🚀 Wie ich ein AI Growth System gebaut habe, das konstant Leads liefert (kein Bullshit)Dev.to AIBuilding Production-Ready Agentic AI Systems for Enterprise Software DeliveryDev.to AII Tried to Automate a Manual Review Task with Claude. It Wasn't Worth It.Dev.to AIWetware AI: Living Brain Cells Trained to Run Chaos Math - Neuroscience NewsGoogle News: Machine LearningNvidia AI tech claims to slash VRAM usage by 85% with zero quality loss — Neural Texture Compression demo reveals stunning visual parity between 6.5GB of memory and 970MBtomshardware.com
AI NEWS HUBbyEIGENVECTOREigenvector

New ways to balance cost and reliability in the Gemini API

blog.googleby Lucia Loher Product Manager Gemini APIApril 2, 20263 min read1 views
Source Quiz

Google is introducing two new inference tiers to the Gemini API, Flex and Priority, to balance cost and latency.

Apr 02, 2026

Introducing Flex and Priority inference: advanced controls for developers to optimize costs and reliability through a single, unified interface.

Hussein Hassan Harrirou

Engineering, Gemini API

Sorry, your browser doesn't support embedded videos, but don't worry, you can download it and watch it with your favorite video player!

Your browser does not support the audio element.

Listen to article

This content is generated by Google AI. Generative AI is experimental

[[duration]] minutes

Today, we are adding two new service tiers to the Gemini API: Flex and Priority. These new options give you granular control over cost and reliability through a single, unified interface.

As AI evolves from simple chat into complex, autonomous agents, developers typically have to manage two distinct types of logic:

  • Background tasks: High-volume workflows like data enrichment or "thinking" processes that don't need instant responses.
  • Interactive tasks: User-facing features like chatbots and copilots where high reliability is needed.

Until now, supporting both meant splitting your architecture between standard synchronous serving and the asynchronous Batch API. Flex and Priority help to bridge this gap. You can now route background jobs to Flex and interactive jobs to Priority, both using standard synchronous endpoints. This eliminates the complexity of async job management while giving you the economic and performance benefits of specialized tiers.

Flex Inference: scale innovation for 50% less

Flex Inference is our new cost-optimized tier, designed for latency-tolerant workloads without the overhead of batch processing.

  • 50% price savings: Pay half the price of the Standard API by downgrading criticality of your request (making them less reliable, and adding latency).
  • Synchronous simplicity: Unlike the Batch API, Flex is a synchronous interface. You use the same familiar endpoints without managing input/output files or polling for job completion.
  • Ideal use cases: Background CRM updates, large-scale research simulations, and agentic workflows where the model "browses" or "thinks" in the background.

Get started fast by simply configuring the service_tier parameter in your request:

Flex tier will be available for all paid tiers and is available for GenerateContent and Interactions API requests.

Priority Inference: Highest reliability for critical apps

The new Priority Inference tier offers our highest level of assurance at a premium price point. This helps to ensure your most important traffic is not preempted, even during peak platform usage.

  • Highest criticality: Priority requests get highest criticality leading to higher reliability, even during peak load.
  • Graceful downgrade: If your traffic exceeds your Priority limits, overflow requests are automatically served at the Standard tier instead of failing. This keeps your application online and helps to ensure business continuity.
  • Transparent response: The API response indicates which tier served your request, giving you full visibility into your performance and billing.
  • Ideal use cases: Real-time customer support bots, live content moderation pipelines, and time-sensitive requests.

To use Priority Inference, simply set the service_tier parameter accordingly:

Priority inference will be available to users with Tier 2 / 3 paid projects across the GenerateContent API and Interactions API endpoints.

Visit the Gemini API documentation to see the full pricing breakdown and start optimizing your production tiers today. To see it in action, check out the cookbook for runnable code examples.

Related stories

.

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by Eigenvector · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

gemini

Knowledge Map

Knowledge Map
TopicsEntitiesSource
New ways to…geminiblog.google

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 141 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!

More in Models

Я протестировал 8 бесплатных аналогов ChatGPT на русском
ModelsLive

Я протестировал 8 бесплатных аналогов ChatGPT на русском

Наступила ночь пятницы, за окном течёт жизнь Киева, а я сижу и смотрю на монитор, видя, как ушли $96 за подписки на AI-инструменты. В голове только одна мысль - сколько из этих денег было потрачено зря? Открываю таблицу с расходами за квартал, и цифра $1 140 за подписки ударяет как молот. Эти сервисы я использовал только треть времени. Меня это бесит. Я собрал промпты по этой теме в PDF. Забери бесплатно: https://t.me/airozov_bot Альтернативы, которые не подводят Начал эксперимент с поиска бесплатных аналогов ChatGPT на русском. Первая находка - это Claude . Он не только не уступает, но часто и превосходит. Особенно в сложных кодовых задачах. Второй находкой стал Cursor . В отличие от ChatGPT, он предлагает более интуитивный интерфейс для разработчиков. В конечном итоге, оба инструмента по