Multi-Model AI Orchestration for Software Development: How I Ship 10x Faster with Claude, Codex, and Gemini
I shipped 19 tools across 2 npm packages, got them reviewed, fixed 10 bugs, and published, all in one evening. I did not do it by typing faster. I did it by orchestrating multiple AI models the same way I would coordinate a small development team. That shift changed how I use AI for software work. Instead of asking one model to do everything, I assign roles: one model plans, another researches, another writes code, another reviews, and another handles large-scale analysis when the codebase is too broad for everyone else. The Problem Most developers start with a simple pattern: open one chat, paste some code, and keep asking the same model to help with everything. That works for small tasks. It breaks down on real projects. The first problem is context pressure. As the conversation grows, t
I shipped 19 tools across 2 npm packages, got them reviewed, fixed 10 bugs, and published, all in one evening. I did not do it by typing faster. I did it by orchestrating multiple AI models the same way I would coordinate a small development team.
That shift changed how I use AI for software work. Instead of asking one model to do everything, I assign roles: one model plans, another researches, another writes code, another reviews, and another handles large-scale analysis when the codebase is too broad for everyone else.
The Problem
Most developers start with a simple pattern: open one chat, paste some code, and keep asking the same model to help with everything. That works for small tasks. It breaks down on real projects.
The first problem is context pressure. As the conversation grows, the model’s context window fills with stale details, exploratory dead ends, copied logs, and half-finished code. Even when the window is technically large enough, quality often degrades because the model is trying to juggle too many concerns at once.
The second problem is that modern codebases are not tidy, single-language systems. The projects I work on often span TypeScript, Python, C#, shell scripts, README docs, test suites, CI config, and package metadata. The mental model required to review a TypeScript AST transform is not the same as the one required to inspect Unity C# editor code or write reliable Python tests.
The third problem is that software development is not one task. It is a bundle of different tasks:
-
writing implementation code
-
researching project conventions
-
reviewing for defects
-
running builds and tests
-
comparing architectures
-
doing large-scale cross-file analysis
-
answering quick lookup questions
Using one model for all of that is like asking one engineer to do product design, coding, testing, documentation, DevOps, and code review at the same time.
The Architecture: Each Model Has a Role
I now use a multi-model setup where each model has a clear job.
Model Role Why This Model
Claude Opus (Orchestrator) Decision-making, planning, user communication, coordination Strongest reasoning, sees the big picture
Claude Sonnet (Subagent) Codebase research, file reading, build/test, pattern finding Fast, cheap, parallelizable
Codex MCP Code writing in sandbox, counter-analysis, code review Independent context, can debate with Opus
Gemini 2.5 Pro Large-scale analysis (10+ files), cross-cutting research 1M token context for massive codebases
This is the important constraint: Opus almost never reads more than three files directly, and it never writes code spanning more than two files.
Opus is my scarce resource. I want its context window reserved for decisions, tradeoffs, and coordination. If I let it spend tokens reading ten implementation files, parsing test fixtures, or editing code across half the repo, I am wasting the most valuable reasoning surface in the system.
So I deliberately make Opus act more like a tech lead than a hands-on individual contributor:
-
It decides what needs to be built.
-
It asks subagents to gather evidence.
-
It synthesizes findings into an implementation spec.
-
It asks Codex to challenge that spec.
-
It resolves disagreements.
-
It sends implementation to the right execution agent.
The Core Principle: Preserve the Orchestrator
The best model should not be your file reader, log parser, or bulk code generator.
If I need to answer questions like these:
-
What conventions does this repo use for new tools?
-
Which helper utilities are already available?
-
How do existing tests structure edge cases?
-
Where does platform-specific formatting happen?
I do not spend Opus on that. I send Sonnet agents to inspect the codebase and return structured findings. If the question spans a huge number of files, I use Gemini for the broad scan and have it summarize patterns, architectural seams, and constraints.
Then Opus makes the decision with clean inputs instead of raw noise.
Real-World Example 1: Building 4 Platform Mappers in One Session
One of the clearest examples was figma-spec-mcp, an open source MCP server that bridges Figma designs to code platforms. The package already had a React mapper, and I wanted to expand it with React Native, Flutter, and SwiftUI support while preserving shared conventions and reusing the normalized UI AST.
Instead, I split the work.
Workflow
-
A Sonnet subagent researched the codebase: tool conventions, type patterns, existing React mapper design, shared helpers, and how the normalized AST flowed through the system.
-
Opus synthesized those findings into a detailed implementation spec.
-
I sent a single Codex prompt: create all three new mappers by reusing the normalized UI AST and following the discovered conventions.
-
Codex wrote more than 2,000 lines across the new mapper surfaces.
-
In a separate Codex review session, I asked it to review the output like a skeptical senior engineer, not like the original author.
-
That review found ten platform-specific bugs.
-
Three Sonnet subagents fixed those bugs in parallel.
-
The full toolset passed TypeScript, ESLint, Prettier, and publint.
What the review caught
The review surfaced bugs that were not obvious from a green-looking implementation:
-
Flutter color output used the wrong byte ordering.
-
React Native had shadowOffset represented as a string instead of an object.
-
SwiftUI output relied on a missing color initializer.
-
A few generated platform props matched one framework’s conventions but not the actual target platform’s API.
Result
I ended that session with four platform mappers, reviewed, fixed, lint-clean, and production-ready in about two hours. The speed came from specialization and parallelism, not from asking one model to “be smarter.”
Real-World Example 2: Contributing to CoplayDev/unity-mcp
The second example was a series of open source contributions to CoplayDev/unity-mcp, a Unity MCP server with over 1,000 stars. The most significant was adding an execute_code tool that lets AI agents run arbitrary C# code directly inside the Unity Editor, with in-memory compilation via Roslyn, safety checks, execution history, and replay support.
The interesting part is how the feature gap was identified. I was already using a different Unity MCP server (AnkleBreaker) for my own projects, and I noticed it had capabilities that CoplayDev lacked. Rather than manually comparing 78 tools against 34, I had AI agents do the comparison systematically.
Workflow
-
I identified the gap myself by working with both MCP servers daily, then used a Sonnet exploration agent to systematically map all tools from AnkleBreaker’s 78-tool set against CoplayDev’s 34 tools. The agent returned a structured comparison table showing exactly which features were missing.
-
From that gap analysis, I picked execute_code as the highest-impact contribution: it unlocks an entire class of workflows where AI agents can inspect live Unity state, run editor automation, and validate assumptions without requiring manual steps.
-
A Sonnet agent deep-dived CoplayDev’s dual-codebase conventions (Python MCP server + C# Unity plugin), studying the tool registration pattern, parameter handling, response envelope format, and test structure.
-
Opus synthesized the research into a detailed implementation spec covering four actions (execute, get_history, replay, clear_history), safety checks for dangerous patterns, Roslyn/CSharpCodeProvider fallback, and execution history management.
-
Codex wrote the full implementation: ExecuteCode.cs (C# Unity handler with in-memory compilation), execute_code.py (Python MCP tool), and test_execute_code.py (unit tests). Over 1,600 lines of additions.
-
Opus reviewed the output and caught issues before the PR went out.
-
The PR was merged after reviewer feedback was addressed.
What the review caught
-
Safety check patterns needed tightening for edge cases around System.IO and Process usage
-
Error line number normalization had to account for the wrapper class offset
-
Compiler selection logic needed a cleaner fallback path
Result
The execute_code tool became one of the more significant contributions to the project, enabling AI agents to do things like inspect scene hierarchies at runtime, validate component references programmatically, and run editor automation scripts. The contribution was grounded in a real gap analysis rather than guesswork, and the multi-model workflow ensured the implementation matched the project’s conventions across two languages.
Real-World Example 3: roblox-shipcheck Shooter Audit Expansion
The third example was roblox-shipcheck, an open source Roblox game audit tool. I wanted to add six shooter-genre-specific tools and expand the package around them with tests, documentation, examples, and release notes.
Workflow
-
Background Sonnet agents worked in parallel on the README rewrite, CHANGELOG, usage examples, and unit tests.
-
Codex wrote all six shooter tools: weapon config audit, hitbox audit, scope UI audit, mobile HUD audit, team infrastructure audit, and anti-cheat surface audit.
-
In a separate review session, Codex reviewed the generated implementation and found eight issues.
-
A Sonnet agent fixed those issues and got 124 tests passing.
-
Sourcery AI, acting as an automated reviewer, found three additional issues.
-
Another Sonnet agent addressed the review feedback and tightened the remaining edge cases.
What the review caught
The first review wave found:
-
ESLint violations
-
heuristics that were too strict for real-world projects
-
false positives for free-for-all game modes
The automated reviewer then found:
-
opportunities to consolidate shared test helpers
-
missing edge cases in the audit suite
-
rough spots in the implementation details around reuse and consistency
Result
The package ended with 49 tools total, 124 passing tests, a cleaner README, updated examples, release notes, and green CI across TypeScript, ESLint, Prettier, and SonarCloud. That is the difference between “I added some code” and “I shipped a maintainable release.”
Token Budget Rules: The Key Insight
The most important lesson in all of this is simple: your orchestrator’s context window is the scarcest resource in the system.
These are the rules I follow now:
-
Opus reads three files or fewer per task. If I need more than that, I delegate the reading to Sonnet or Gemini and ask for a structured summary.
-
Opus writes code in two files or fewer. If the task spans more than two files, I send it to Codex with a detailed spec.
-
Before starting any task, I ask: “Can a subagent do this?” If the answer is yes, I stop and delegate.
-
Codex reviews everything. Even code Codex wrote itself. The review happens in a separate session so it can challenge its own assumptions.
-
Independent work gets parallel agents. If docs, tests, examples, and changelog updates do not depend on each other, they should run at the same time.
Here is the mental model I use:
Opus = scarce strategic bandwidth Sonnet = cheap parallel investigation Codex = isolated implementation and review Gemini = massive-context research passOpus = scarce strategic bandwidth Sonnet = cheap parallel investigation Codex = isolated implementation and review Gemini = massive-context research passEnter fullscreen mode
Exit fullscreen mode
Once I started treating context like a budget instead of an infinite buffer, my sessions became dramatically more reliable.
The Debate Pattern
One of the most effective techniques in this setup is what I call the debate pattern.
Instead of asking one model for a solution and immediately implementing it, I force a disagreement phase.
The process
-
Opus analyzes the problem and proposes a solution.
-
Codex receives that analysis and produces counter-analysis: where it agrees, where it disagrees, and what it would change.
-
If there are conflicts, I do one follow-up round to resolve them.
-
Once there is consensus, I convert that into an implementation plan.
-
Codex implements.
-
A separate Codex session reviews the result.
This works because disagreement exposes hidden assumptions.
In one session, that debate caught:
-
Flutter Color formatting confusion between 0xRRGGBBAA and 0xAARRGGBB
-
React Native Paper prop mismatch using mode where variant was correct
-
a non-existent SwiftUI Color(hex:) initializer
None of those issues were broad architectural failures. They were the kind of platform-specific correctness bugs that burn time after merge if you do not catch them early.
The debate pattern turns AI assistance from “fast autocomplete” into “adversarial design review plus implementation.”
Results
The performance difference is large enough that I now think in terms of orchestration by default.
Metric Single Model Multi-Model Orchestration
Tools shipped per session 2-3 10-15
Bugs caught before publish ~60% ~95% (Codex review)
Parallel workstreams 1 6+ simultaneous
Context preservation Degrades after 3-4 files Stays sharp (delegated)
Convention compliance Often drifts Exact match (research first)
Getting Started
If you want to try this workflow, start simple. You do not need a huge automation stack on day one. You just need role separation and a few clear rules.
My practical setup
-
Claude Code CLI with Opus as orchestrator for planning, decisions, and user-facing coordination
-
Codex MCP server (npm: codex) for implementation, sandboxed code changes, and review
-
Gemini MCP (npm: gemini-mcp-tool) for large-scale repo analysis and broad research across many files
-
Sonnet subagents via Claude Code’s Agent tool for codebase research, builds, tests, pattern extraction, docs, and support work
The most important operational detail is to write your rules down in CLAUDE.md. If the orchestrator has to rediscover your preferences every session, you lose consistency and waste tokens.
My CLAUDE.md contains rules like:
- Opus reads <= 3 files directly
- Opus writes <= 2 files directly
- Delegate codebase exploration to Sonnet
- Use Codex for implementation spanning multiple files
- Always run a separate review pass before publish
- Prefer parallel subagents for independent tasks`
Enter fullscreen mode
Exit fullscreen mode
That single file turns ad hoc prompting into a repeatable operating model.
A good first workflow
If you want a low-friction way to start, try this:
-
Use Sonnet to inspect the repo and summarize conventions.
-
Use Opus to write a short implementation spec.
-
Use Codex to implement across the affected files.
-
Use a fresh Codex session to review for defects.
-
Use Sonnet to fix issues and run tests.
Practical Lessons
Three habits made the biggest difference for me.
First, I stopped treating AI output as a finished artifact and started treating it as a managed workstream. Every meaningful code change has research, implementation, review, and verification phases. Different models are better at different phases.
Second, I learned that independent context is a feature, not a limitation. When Codex reviews code from a separate session, it does not inherit all the assumptions of the implementation pass. That distance is exactly why it catches bugs.
Third, I stopped optimizing for “best prompt” and started optimizing for “best routing.” The better question is: which model should spend tokens on this specific task?
Conclusion
The future of AI-assisted development is not a single omniscient model sitting in one giant chat. It is orchestration: using the right model for the right task, preserving your strongest model’s context for decisions, and letting specialized agents handle research, implementation, review, and verification.
If you are already using AI in development, my practical advice is simple: stop asking one model to do everything. Give each model a role, protect your orchestrator’s context window, and add a real review pass. That is where the 10x improvement comes from.
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
claudegeminimodel
Microsoft and Google release new AI models on the same day: capabilities in speech, image, and local open-source solutions fully deployed. - 富途牛牛
Microsoft and Google release new AI models on the same day: capabilities in speech, image, and local open-source solutions fully deployed. 富途牛牛

2026: The year of throwing my agency at my health (now with added cyborgism)
I have bipolar disorder. I was diagnosed in late 2012 following my one and only severe manic episode. Most psychiatrists would regard me as a resounding success case – I never even remotely come close to suicidal depression, manic delusions of grandeur, impulsive spending, or irresponsible sexual behavior. By standard measures, I am well-adjusted, functional, and successful. Part of this relative success is adherence to appropriate medication, and another part is maintaining good insight [1] into my mental state. Years ago, I defined a personal bipolar index scale to communicate to myself and close ones my mental state. My bipolar index ranges from -10 to +10 and is a subjective self-report. -10 would be a state of extreme suicidal depression. +10 would be extreme mania with complete loss
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Releases

Microsoft and Google release new AI models on the same day: capabilities in speech, image, and local open-source solutions fully deployed. - 富途牛牛
Microsoft and Google release new AI models on the same day: capabilities in speech, image, and local open-source solutions fully deployed. 富途牛牛

Q1 2026 Timelines Update
We’re mostly focused on research and writing for our next big scenario, but we’re also continuing to think about AI timelines and takeoff speeds, monitoring the evidence as it comes in, and adjusting our expectations accordingly. We’re tentatively planning on making quarterly updates to our timelines and takeoff forecasts. Since we published the AI Futures Model 3 months ago, we’ve updated towards shorter timelines. Daniel’s Automated Coder (AC) median has moved from late 2029 to mid 2028, and Eli’s forecast has moved a similar amount. The AC milestone is the point at which an AGI company would rather lay off all of their human software engineers than stop using AIs for software engineering. The reasons behind this change include: 1 We switched to METR Time Horizon version 1.1 . We include

Why pgEdge thinks MCP (not an API) is the right way for AI agents to talk to databases
The Postgres open-source object-relational database system can trace its history back to some three decades, but it s no artifact. Its The post Why pgEdge thinks MCP (not an API) is the right way for AI agents to talk to databases appeared first on The New Stack .


Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!