Models gemini model valuation review reasoning agent

Closing the knowledge gap with agent skills

Google Developers BlogMarch 31, 20261 min read0 views

To bridge the gap between static model knowledge and rapidly evolving software practices, Google DeepMind developed a "Gemini API developer skill" that provides agents with live documentation and SDK guidance. Evaluation results show a massive performance boost, with the gemini-3.1-pro-preview model jumping from a 28.2% to a 96.6% success rate when equipped with the skill. This lightweight approach demonstrates how giving models strong reasoning capabilities and access to a "source of truth" can effectively eliminate outdated coding patterns.

MARCH 25, 2026

Large language models (LLMs) have fixed knowledge, being trained at a specific point in time. Software engineering practices are fast paced and change often, where new libraries are launched every day and best practices evolve quickly.

This leaves a knowledge gap that language models can't solve on their own. At Google DeepMind we see this in a few ways: our models don't know about themselves when they're trained, and they aren't necessarily aware of subtle changes in best practices (like thought circulation) or SDK changes.

Many solutions exist, from web search tools to dedicated MCP services, but more recently, agent skills have surfaced as an extremely lightweight but potentially effective way to close this gap.

While there are strategies that we, as model builders, can implement, we wanted to explore what is possible for any SDK maintainer. Read on for what we did to build the Gemini API developer skill and the results it had on performance.

What we built

To help coding agents building with the Gemini API, we built a skill that:

explains the high-level feature set of the API,
describes the current models and SDKs for each language,
demonstrates basic sample code for each SDK, and
lists the documentation entry points (as sources of truth).

This is a basic set of primitive instructions that guide an agent towards using our latest models and SDKs, but importantly also refers to the documentation to encourage retrieving fresh information from the source of truth.

The skill is available on GitHub or install it directly into your project with:

# Install with Vercel skills npx skills add google-gemini/gemini-skills --skill gemini-api-dev --global

# Install with Vercel skills npx skills add google-gemini/gemini-skills --skill gemini-api-dev --global

Install with Context7 skills

npx ctx7 skills install /google-gemini/gemini-skills gemini-api-dev`

Shell

Copied

Skill tester

We created an evaluation harness with 117 prompts that generate Python or TypeScript code using the Gemini SDKs that are used to evaluate skill performance.

The prompts evaluate across different categories, including agentic coding tasks, building chatbots, document processing, streaming content and a number of specific SDK features.

We ran these tests both in "vanilla" mode (directly prompting the model) and with the skill enabled. To enable the skill, the model is given the same system instruction that the Gemini CLI uses, and two tools: activate_skill and fetch_url (for downloading the docs).

A prompt is considered a failure if it uses one of our old SDKs.

Skills work, but they need reasoning

The top-line results:

The latest Gemini 3 series of models achieve excellent results with the addition of the gemini-api-dev skill, notably coming from a low baseline without it (6.8% for both 3.0 Pro and Flash, 28% for 3.1 Pro).
The older 2.5 series of models also benefit, but nowhere near as much. Using modern models with strong reasoning support makes a difference.

All categories performed well

Adding the skill was effective across almost all domains for the top-performing model (gemini-3.1-pro-preview).

SDK Usage had the lowest pass rate, at 95%. There is no stand-out reason for this; the failed prompts cover a range of tasks that include some difficult or unclear requests, but notably they include prompts that explicitly request Gemini 2.0 models.

Here's an example from the SDK usage category that failed across all models.

When I use the Python api with the gemini 2.0 flash model, and when the output is quite long, the returned content will be an array of output chunks instead of the whole thing. i guess it was doing some kind of streaming type of input. how to turn this off and get the whole output together

Skill issues

These initial results are quite encouraging, but we know from Vercel's work that direct instruction through AGENTS.md can be more effective than using skills, so we are exploring other ways to supply live knowledge of SDKs, such as directly using MCPs for documentation.

Skill simplicity is a huge benefit, but right now there isn't a great skill update story, other than requiring users to update manually. In the long term this could leave old skill information in user's workspaces, doing more harm than good.

Despite these minor issues we’re still excited to start using skills in our workflows. The Gemini API skill is still fairly new, but we’re keeping it maintained as we push model updates, and we will be exploring different avenues for improving it. Follow Mark and Phil for updates as we tune the skill, and don’t forget to try it out and let us know your feedback!

Original source

Google Developers Blog

https://developers.googleblog.com/closing-the-knowledge-gap-with-agent-skills/

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

geminimodelvaluation

Self-Evolving AILive

OutSystems Introduces Agentic Systems Engineering to Power Governed, Open Enterprise AI - AiThority

<a href="https://news.google.com/rss/articles/CBMixAFBVV95cUxNWU5mTGVIaXY0YWs0aUJ2aEdvSUxuSWpiRmZMVVJpX2R4dm5pbVdRUVpKZmt6Z1JOY1YxRy1DU0FTZGU0Qk1zVWtCcGJBVzFGVUtuRlBFY0NEYXVtSkZDWTFiSkhlSjM0c0d0VURQRGxMdUhfVjY0eXE0MEZyaFNUdmUzVTdYMU90Z1FxcFhxRzhPVDNiMHpGRWN1dTgtdlVNXy13SXBvMy1rT1NlbURScEhxSk9IRGZ5c201aWZ3cFRjOXNh?oc=5" target="_blank">OutSystems Introduces Agentic Systems Engineering to Power Governed, Open Enterprise AI</a> AiThority

Google News: Machine Learning

1mabout 1 hour ago

Market NewsLive

OpenAI Says It Raised $122 Billion At An $852 Billion Valuation - Finimize

<a href="https://news.google.com/rss/articles/CBMilAFBVV95cUxOQTBuM1N1cElfdnpmUlRFaWFBZVhtRTc0UWVydF92X29CRkRoWE1EZ2huV2lmY1g4bVlwYnEzSmtTUTBTamprOWJ3UG5nV1d1Wm9QYk1ycUVTZW9BYzNkV0tZeVZzTERGTDdheWN1ZVdKd0VnRzV0bEtCUm5GVVNTN0xvX1hEWXBUd2FHb0NTbmJIMGZI?oc=5" target="_blank">OpenAI Says It Raised $122 Billion At An $852 Billion Valuation</a> Finimize

Google News: OpenAI

1mabout 1 hour ago

ModelsRecent

Exclusive | Caltech Researchers Claim Radical Compression of High-Fidelity AI Models - WSJ

<a href="https://news.google.com/rss/articles/CBMiuANBVV95cUxNNWh0OTV4cnNDLVdHdHVUdE02cWRiaE03VENfdWJFbFlyaXZmbWtJNm9OdU05TXVsRjd4dVFwUzl0WkRfLVJoVzlaNkhKRVl4S0Y0Um5jN2QzZzhsb0twMElFOEpSZjdjX1pZZzNacXIxU2U4Ulloam5nR1hQeXg3TWhoMEE1ZzFzQmdiSjktRG1rUEs2YVVhZ0VMMk0wS3J6SWNJdTdJZlAtTEE0SUdaaFl5QWFUWS05NGFDN1FudnNRN2ZpcnFmM0N1bGVpSjNYZmZ6MUJKSkpMWk5tRWFSN2s4V0tEdi1EVVBuUTdnZm92Sjk0MEVYZWRieTkxNWMwRzRiQmxWVHpvaGEwUnpEZGJ1UVFhQmoydGxSTW93XzFVR1ZHeG5mMTZOLWthOVVKZTZMeGdsS0dDaUROelpWc1l4QmJLNWkzRkhGUGdua3hnOHFWYUpXQWp3RktyemZiN0VBTFhfNGFZUHpNaV9jX2U0Sk9Fb2k1dXhOZHdENWpPc2dRU2ZQeHZoMnBZNEN6RHJnNU1YYk9SSzRYNzZrbXRnQ3VOdE0ydGFCM3ZBVTdHeFJLY29Feg?oc=5" target="_blank">Exclusive | Caltech Researchers Claim Radical Compression of High-Fidelity AI Models</a> WSJ

Google News: LLM

1mabout 16 hours ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 158 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Models

ModelsLive

Claude Code Leak Reveals Anthropic’s Internal Playbook - Startup Fortune

<a href="https://news.google.com/rss/articles/CBMiiAFBVV95cUxQT29vMFNyN0gzZVA0VXMwdDJEcVVTRXJIQlNRU1NzMHhpM1dGaDNNQTdIR3VBdjRuTFkxQTZVZXNKUUZkR0F3eWRJWENLa1BVU0tlbnJ0djdTekN0WGduc0FpWFRfcGN5am5BMWlUblZGX1V3QmtMWF9pZzYxQ0l2X1M0VGtNU0do?oc=5" target="_blank">Claude Code Leak Reveals Anthropic’s Internal Playbook</a> Startup Fortune

Google News: Claude

1m29 minutes ago

ModelsLive

I use these 5 simple ‘ChatGPT codes’ every day — and they instantly improve my results - TechRadar

<a href="https://news.google.com/rss/articles/CBMi1wFBVV95cUxNbzhUTm9iTHh5bGJJMVRZcDRBeDdZUXZZWFhvdFF3cHhocGhSV0JXLTdBWDNab21mdFFDdzBWVGpyY1VUZmN1aFI3YWtnQW9mQ2dRUW1ndG9iZjBsZ3lqdE1ZQTZDaUVYZV9rLWNmamNCQTJEUXFkOFNMTTA2WUY0bmpWalBjY1hBRms3M2ZjOEIwRHFWM0F4YlJCMGVHUms3REJ6ZlpkWjU1OUU3VWNWU09JTnBIYVpYNzhaclRoZW5aMUp1djAxZ2hKTmJZQ0tuMUFXNXpkNA?oc=5" target="_blank">I use these 5 simple ‘ChatGPT codes’ every day — and they instantly improve my results</a> TechRadar

Google News: ChatGPT

1m22 minutes ago

ModelsRecent

Exclusive | Caltech Researchers Claim Radical Compression of High-Fidelity AI Models - WSJ

Google News: LLM

1mabout 16 hours ago

Models

Exclusive | The Sudden Fall of OpenAI’s Most Hyped Product Since ChatGPT - WSJ

<a href="https://news.google.com/rss/articles/CBMiogNBVV95cUxPSWViZ2JwYXFmcUljdGE5QjdiZVFXcl9iWF9jdjQwTVZBSUlvQWVxLU1UQk1LdDMtSUd2eU5oZ3I4SHdpbFhQdjFMZFRoazVDRi1INzFZaW9BTFByblhzNmZLb1gzVEFET1NCRk9mdHNLTEUxckRYVUtqTF85X0txRHA0cHlGTmcxcm1xMzRCa2g0bU16UGNqY29hdEoxSUdKTUZqRnhNa3JRQ3dCLXRWTUwxYkNjZ2d6YkpvUjZGbmNRM0RuektMTFNOQjY5cWNFc1lKcllyS2ZaRjI5LVpYa0NQODViZV95TTVFQUhOdGlXanIzdnlqOHBwMWtGaTBENmlraHdRdWpJSy1hZDRCcXFKNUhGMnpEWnBNVXJxMmRNQk12cjNmYVdfREl2U1pIU3NJRG5OeTgya1VYaEdPUUVodHZFNkhyM1VsSmxSZ3BHdUxqZ2FYdmhteXpsdlB1SFNQZUxkdHdiYktmM1R5eERaVTRoQ09tbmItekdZMW5DbWh0elRhUzBheUlZbzFRTkVhRVBSc19xOGl4dGgxcjZ3?oc=5" target="_blank">Exclusive | The Sudden Fall of OpenAI’s Most Hyped Product Since ChatGPT</a> WSJ

Google News: ChatGPT

1m2 days ago