End of March - what’s up with Coding Agents?
One of the most interesting things in THE WORLD is the rapid advancement of models/coding agents / whatever these things are. Models in a coding harness? Well, there are models, and then harnesses. I guess this is about the models.
One of the most interesting things in THE WORLD is the rapid advancement of models/coding agents / whatever these things are. Models in a coding harness? Well, there are models, and then harnesses. I guess this is about the models.
As you likely know, if you are coding with ai this month you are either on a subscription plan to Claude Code or Open AI’s Codex. (side note - interesting that it’s not ChatGpt Codex, apparently the MASSIVE name recognition for chatgpt isn’t working for the marketing geniuses at OpenAI) or you are paying by the token via API.
If you are a hobbyist or solo practitioner like me, you are subscribing to the incredible, loss leader deals that Claude and Codex offer at $200 a month. I have both and I never run out of tokens, and by my count I would be spending thousands of $ via the api.
Okay whatever. Here’s the gist - I had my agenty-team-people review what’s available today. Why? I have several openclaw installs with many agents. My token usage is ramping and I’m quite curious about where the future lies when these incredible subscriptions go away.
The following was written by 3 agenty-team-people.
Orchestration: Ops Agent: Opus 4.6
Research: Analytics-Agent: Model Kimi K2.5 Sonnet 4.6
(Note: Analytics was supposed to run on Kimi K2.5 (its default model) but I had to override to Sonnet after Kimi spent 22 minutes sending garbage web search queries. Kimi can’t do tool calling reliably — it’s good at reasoning over data that’s already in context, not at gathering it apparently!)
Writing: Marketing Agent: Model: Sonnet 4.6
The top five coding models are within one percentage point of each other. This seems kind of remarkable but I don’t know what this really means - are they good at the same things? Some are better at some things than others? No idea.
Claude Opus 4.5 leads at 80.9% on what seems to be the testing standard “SWE-bench Verified”. Fourth place is MiniMax M2.5 at 80.2%. MiniMax is a model from a company in China priced at $0.30/M input tokens instead of Claude’s $5. I note that it’s from China not because the model works in Chinese but because there is a fascination of how great models are coming from China and it’s confusing because they don’t have access to our cutting edge hardware from Nividia. And because maybe they are spying on you. And because
If you’re agonizing over which model to pick, you’re optimizing the wrong thing.
I’ve been using AI coding tools daily since Copilot launched. I run three companies, ship code across multiple stacks, and have strong opinions about what actually matters. Here’s what I learned researching every major coding model in March 2026.
The Three Metrics That Actually MatterPermalink
Efficacy isn’t just benchmarks. SWE-bench Verified has a contamination problem — Claude Opus 4.5 scores 80.9% on the verified set but drops to 45.9% on SWE-bench Pro (the uncontaminated version). That’s a 35-point gap. Models trained on test data.
Look at LiveCodeBench instead. It pulls fresh problems from LeetCode, AtCoder, and CodeForces. Gemini 3 Pro crushes it at 91.7%. DeepSeek V3.2 hits 89.6%. These are clean numbers.
Cost isn’t $/million tokens. That’s an abstraction. Real cost is what you pay per 10,000 coding conversations. Assume ~500 input tokens and ~3,200 output tokens per request (that’s a typical refactoring task). Here’s what it actually costs:
| Model | Input $/M | Output $/M | Cost per 10K requests | SWE-bench |
|---|---|---|---|---|
| Claude Opus 4.5 | $5.00 | $25.00 | ~$37 | 80.9% |
| Gemini 3.1 Pro | $2.50 | $15.00 | ~$22 | 80.6% |
| GPT-5.2 | $1.75 | $14.00 | ~$20 | 80.0% |
| Kimi K2.5 | $0.45 | $2.20 | ~$4 | 76.8% |
| DeepSeek V3.2 | $0.27 | $1.00 | ~$2 | 73.1% |
Kimi gives you 95% of Claude’s performance at 11% of the cost. DeepSeek gives you 90% at 5% of the cost.
Buzz is what developers actually use. Not what benchmarks say they should use.
According to Pragmatic Engineer’s survey, “Anthropic has become the go-to model developer for coding-related work — Opus 4.5 and Sonnet 4.5 come up more often than all other models, combined.”
Claude Sonnet 4.5 is #6 on SWE-bench Verified but #1 in developer hearts. Why? Trust. Ecosystem. Error handling. Edge cases. The stuff benchmarks don’t measure.
The Leaderboard (Reality Edition)Permalink
Tier 1: Flagship — Pay for Peace of MindPermalink
Best in Tier: GPT-5.2 ($1.75/M input, 80.0% SWE-bench)
| Model | Cost/10K | SWE-bench | Why Use It |
|---|---|---|---|
| Claude Opus 4.5 | $37 | 80.9% | Developer trust, best error recovery |
| Claude Opus 4.6 | $37 | 80.8% | Latest, 1M context at standard pricing |
| Gemini 3.1 Pro | $22 | 80.6% | Strong on fresh benchmarks (91.7% LiveCodeBench) |
| GPT-5.2 | $20 | 80.0% | Best value in tier, 400K context |
When to pay up: Mission-critical refactoring. Large enterprise codebases. When you can’t afford to debug bad suggestions.
The honest take: These four are statistically tied. Pick based on ecosystem. If you’re using Cursor or Windsurf, you can switch between all of them. Don’t overthink it.
Tier 2: Sweet Spot — 90% Quality, 30% CostPermalink
Best in Tier: Kimi K2.5 ($0.45/M input, 76.8% SWE-bench)
| Model | Cost/10K | SWE-bench | Why Use It |
|---|---|---|---|
| Kimi K2.5 | $4 | 76.8% | Best value in market, native multimodal |
| MiniMax M2.5 | $2 | 80.2% | Dark horse, #4 overall, shockingly good |
| Claude Sonnet 4.6 | $25 | 79.6% | Developer favorite, most trusted |
| Gemini 3 Flash | $5 | 78.0% | Flash pricing, flagship performance |
| Qwen3.5-397B | $6 | 76.4% | Open source, strong agentic capabilities |
This is where I live. Kimi K2.5 at $4 per 10K requests vs Claude Opus at $37 is a no-brainer for 90% of my work. I use Opus when I’m refactoring something critical. Everything else? Kimi or MiniMax.
MiniMax M2.5 is the surprise here. #4 on SWE-bench Verified, beating every OpenAI model except GPT-5.2. Chinese model, barely any developer buzz, but the numbers don’t lie. At $2 per 10K requests, it’s absurdly good value.
Tier 3: Budget Champions — Pennies, Not DollarsPermalink
Best in Tier: DeepSeek V3.2 ($0.27/M input, 73.1% SWE-bench)
| Model | Cost/10K | SWE-bench | Why Use It |
|---|---|---|---|
| DeepSeek V3.2 | $2 | 73.1% | 98% cheaper than Claude, 89.6% LiveCodeBench |
| MiMo-V2-Flash | $1 | 73.4% | Cheapest option that doesn’t suck |
| Gemini 2.5 Flash | $4 | 60.4% | Free tier available, Google ecosystem |
DeepSeek is the open source champion. 73.1% on SWE-bench at $0.27/M input. That’s 98% cheaper than Claude Opus. Five cents vs five dollars per million input tokens.
Real-world performance is mixed. Some developers love it, others report it’s hit-or-miss on complex tasks. But at this price, you can afford to regenerate a few times.
The Developer Favorites (Buzz Rankings)Permalink
Reddit, Hacker News, and Twitter paint a different picture than benchmarks.
#1: Claude Sonnet 4.5 — “Generally considered the best coding model all around” (r/ChatGPTCoding)
Sonnet 4.5 is #6 on SWE-bench Verified. But every developer thread defaults to “just use Sonnet.” Why? It nails the edge cases. Error messages are clear. It doesn’t hallucinate package names. When it refactors, it maintains style consistency.
Benchmarks measure average case. Developers care about worst case. Claude wins on trust.
#2: Kimi K2.5 — “Costs almost 10% of what Opus costs at similar performance” (r/LocalLLaMA)
The value pick. 76.8% on SWE-bench, $0.45/M input. Native multimodal (handles images and text). One-trillion-parameter model that nobody’s heard of, punching way above its weight.
I’ve been using Kimi K2.5 for three weeks. It’s shockingly good. Not quite Claude on complex refactoring, but 90% of the way there at 11% of the cost.
#3: MiniMax M2.5 — “Is this the best coding model in the world?” (r/LocalLLaMA)
Another Chinese model. #4 on SWE-bench Verified (80.2%), beating GPT-5.2. At $0.30/M input, it’s in the same price tier as DeepSeek but performs 7 percentage points higher.
No buzz outside Reddit’s LocalLLaMA community, but the numbers are real. If you’re cost-sensitive and willing to try something off the beaten path, MiniMax delivers.
The Benchmark Reality CheckPermalink
SWE-bench Verified is contaminated. The gap between Verified and Pro scores proves it:
| Model | Verified Score | Pro Score | Gap |
|---|---|---|---|
| Claude Opus 4.5 | 80.9% | 45.9% | -35 points |
| GPT-5 | 74.9% | 23.3% | -51 points |
| Claude Opus 4.1 | 74.5% | 23.1% | -51 points |
These models were trained on problems that overlap with SWE-bench Verified. When you test them on fresh, uncontaminated problems (Pro), performance craters.
So what’s actually reliable?
LiveCodeBench pulls fresh problems from competitive programming sites. It’s harder to game:
| Model | LiveCodeBench | Developer |
|---|---|---|
| Gemini 3 Pro | 91.7% | |
| Gemini 3 Flash (Reasoning) | 90.8% | |
| DeepSeek V3.2 | 89.6% | DeepSeek |
| Claude Opus 4.x | 85.0% | Anthropic |
| Qwen 3 235B | 74.1% | Alibaba |
Gemini is crushing it on clean data. But where’s the developer buzz? Nowhere. Reddit barely mentions it. Cursor and Windsurf default to Claude and GPT-5.
Why the disconnect?
Gemini’s API has rough edges. Long-context pricing jumps sharply above 200K tokens. Error messages are cryptic. The ecosystem hasn’t caught up.
Claude and OpenAI win on polish, not raw capability.
HumanEval is saturated. Claude Opus scores 99%. Everyone else is 88-90%. It’s no longer a differentiator. Stop citing HumanEval in 2026.
The Open Source SurprisePermalink
2024: “Open source will never catch up to proprietary models.”
2026: MiniMax M2.5 is #4 on SWE-bench Verified. DeepSeek V3.2 is 98% cheaper than Claude at 90% of the performance.
Self-hosting is now viable for serious work.
Hacker News, March 2026: “Kimi K2 on a pair of Mac Studios — 24 tokens/second, negligible power cost vs developer salary.”
Llama 4 Scout has a 10 million token context window. That’s an entire codebase in one shot. Open source. Free to run locally.
Qwen3-Coder 480B hits 69.6% on SWE-bench Verified. MIT/Apache 2.0 license. Deploy it anywhere.
GLM-4.7 (358B parameters, MIT license) scores 73.8% on SWE-bench. You can run this on a cluster of consumer GPUs.
The cost hedge is real. Current API pricing is subsidized VC money. Hacker News sentiment: “Their pricing models are simply not sustainable. Like cab hailing, shopping, social media ads… prices will start going up with nowhere to run.”
Build multi-model workflows now. Use Claude for critical refactoring, Kimi for everything else, DeepSeek for batch jobs, Llama 4 for local experimentation. Don’t lock yourself into one vendor.
The Tools That MatterPermalink
Models are commodities. Tools are differentiators.
IDEs: Cursor vs Windsurf vs CopilotPermalink
| Tool | Default Models | Pricing | Why Use It |
|---|---|---|---|
| Cursor | GPT-5.x, Claude Sonnet 4.5 | $20/mo | #1 choice, best UX, BYOM support |
| Windsurf | Claude Sonnet 4.5, SWE-1.5 | $15/mo | 80% of Cursor at 75% price, Flow-state awareness |
| Copilot | GPT-4o, GPT-5.x | $10/mo | Tightest GitHub integration, lowest team cost |
I use Cursor. Everyone I know uses Cursor. It’s the default now.
Windsurf is interesting. Flow-state awareness (it tracks what you’re focused on) and Arena Mode (compare models side-by-side in real-time) are killer features. At $15/mo, it’s the value pick.
Copilot is for teams locked into GitHub/Microsoft. $39/seat/month for business. Lowest cost per developer at scale.
All three support BYOM (bring your own model). Cursor and Windsurf let you plug in any OpenAI-compatible API. Use Kimi, MiniMax, or DeepSeek if you want.
CLI Agents: Claude Code vs Codex vs AiderPermalink
| Tool | Default Model | Pricing | Why Use It |
|---|---|---|---|
| Claude Code | Claude Sonnet 4.5 | API costs only | Most trusted for complex refactoring |
| OpenAI Codex | GPT-5.2 Codex | API costs only | Best for automated PR workflows |
| Aider | Claude, GPT-5.x | Free + API | Git-native, beloved by power users |
Claude Code is the official Anthropic CLI. I use it for refactoring sessions. It’s the most reliable for multi-step edits across dozens of files.
Codex (OpenAI’s CLI agent) excels at automated workflows. Spawn it in CI, give it a task, it opens a PR. GPT-5.2 Codex scored 64% on Terminal-Bench 2.0 (agentic terminal tasks). GPT-5.3 Codex hit 77.3%.
Aider is for git power users. It speaks in commits and diffs. Free, open source, supports every model via API.
The Agent ShiftPermalink
We’re moving from autocomplete → autonomous agents.
Cursor Agent Mode, Windsurf Cascade, Codex auto-PR — these tools take a goal, break it into steps, execute, debug, and iterate until done.
SWE-bench measures this capability. So does Terminal-Bench 2.0. Models are getting better at multi-step reasoning, but they’re still fragile. You need to check their work.
Dario Amodei (Anthropic CEO, March 2026): “Within six months 90% of all code will be written by AI.”
Maybe. But 100% of code will still be reviewed by humans. For now.
The Ship Velocity WarPermalink
OpenAI shipped three major GPT-5 releases in three months:
- GPT-5.2: December 2025
- GPT-5.3: March 2026
- GPT-5.4: March 2026 (two days later)
Developer reaction: Fatigue.
Reddit, March 2026: “Three months between 5.2 and 5.3, then two days to 5.4? What are we even testing here?”
Contrast this with Anthropic. Two releases in the same period:
- Claude Opus 4.5: November 2025
- Claude Sonnet/Opus 4.6: February 2026
Both releases dominated SWE-bench. Both earned developer trust. Slow and steady wins.
Rapid releases feel like thrashing, not progress. Developers want stability. We build workflows around models. Switching costs are high. Ship less, ship better.
The New Hotness (March 2026)Permalink
The leaderboard is moving fast. Here’s what just landed or is generating serious heat this month:
GPT-5.4 — OpenAI’s “We’re Back” MomentPermalink
Dropped March 5th — two days after GPT-5.3. Native computer use, 1M context in Codex, 57.7% on SWE-bench Pro (the uncontaminated benchmark). That Pro score is the real story — it’s the highest any OpenAI model has hit on clean data.
The vibe from Every.to: “I’m reaching for GPT-5.4 more than Codex 5.3 — not because it’s dramatically more intelligent on raw coding quality, but because it’s much better to work with moment to moment.”
GPT-5.4 Mini and Nano also just shipped — near-flagship performance at fraction of the cost. This is where the price war gets real.
MiMo-V2-Pro (a.k.a. “Hunter Alpha”) — Xiaomi’s Stealth BombPermalink
This one’s wild. Xiaomi dropped a trillion-parameter model that was quietly running on OpenRouter under the codename “Hunter Alpha” — racking up 1 trillion tokens before anyone knew who made it.
The numbers: 75.7 on Claw-Eval (3rd globally, behind only Claude Opus 4.6), beats Claude Sonnet 4.6 at coding, approaching Opus-level on agent tasks — at 67% lower cost. VentureBeat called it “stunning.”
A phone company is making one of the best coding models in the world. We’re in that timeline now.
Grok 4.20 — xAI’s Multi-Agent ArchitecturePermalink
Elon’s team went a completely different direction: four AI agents running in parallel. 75% on SWE-bench, 79.6% on Aider Polyglot. The architecture is genuinely novel — not just a bigger transformer.
Early developer reaction is positive but cautious. The multi-agent approach means it handles complex, multi-file tasks differently than single-model approaches.
Seed 2.0 Pro — ByteDance Enters the ArenaPermalink
ByteDance’s play. 76.5% SWE-bench Verified, AIME 98.3, Codeforces rating of 3020. Released February 14th, strategically timed right before China’s Spring Festival.
Three variants: Pro (flagship intelligence), Lite (balanced), Mini (speed/cost). The Pro tier is legitimately competing with Claude and GPT-5.x on coding. ByteDance has the infrastructure to subsidize pricing aggressively.
Nemotron 3 Super — Nvidia’s Efficiency PlayPermalink
Nvidia isn’t competing on raw benchmark scores — Nemotron 3 Super sits at #26 on LiveCodeBench. But it’s doing something different: 2.2x higher throughput than comparable open models, 1M context window via hybrid Mamba-Transformer architecture, and it runs on Nvidia’s own hardware stack.
The real play: if you’re running your own inference cluster on Nvidia GPUs, this is built for you. Greptile’s review: “Punches far above its weight class.”
Step-3.5-Flash — The Price DestroyerPermalink
StepFun’s model. 74.4% SWE-bench Verified at $0.10/M input, $0.40/M output. That’s $0.63 per 10K coding requests. Less than a dollar for ten thousand conversations.
196B parameters. Nobody talks about it. The numbers are insane for the price.
GLM-5 — Zhipu AI’s Quiet ClimbPermalink
78.0% SWE-bench Verified. 744B parameters. $1.00/M input, $3.20/M output. Zhipu AI has been steadily climbing the leaderboard without any of the hype. GLM-5 is now #8 overall, ahead of Kimi K2.5.
If the Chinese model ecosystem keeps shipping at this pace, the pricing floor for top-tier coding models is going to zero.
What I Actually RunPermalink
Full transparency — here’s my real stack. I run OpenClaw, an AI agent orchestration platform, with 8 functional agents across multiple businesses. This isn’t theoretical.
My Daily ModelsPermalink
| Model | What I Use It For | Monthly Spend |
|---|---|---|
| Claude Opus 4.6 | Primary interface (COO agent), complex orchestration | Heavy — it’s my main brain |
| Claude Sonnet 4.5 | Marketing, Design, Support, Ops agents | Moderate — workhorse for tool-heavy tasks |
| GPT-5.2 Codex | Engineering agent — pure code generation | On-demand — spawned for coding tasks |
| Kimi K2.5 | Finance, Analytics agents — data analysis | Light — great for reasoning, terrible at tool calling |
| Qwen 3.5 122B | Local inference on Mac Studio | Free (electricity only) |
| Qwen 2.5 14B/7B | Edge inference on home server | Free |
Lessons from Actually Running Multi-ModelPermalink
Kimi K2.5 can’t do tool calling. It spent 22 minutes sending garbage web search queries when I asked it to research this very blog post. Great at reasoning over data already in context. Terrible at gathering data via tools. I override to Sonnet for any agentic work.
Codex is a beast in a harness, mediocre in conversation. GPT-5.2 Codex shines when you give it a codebase, a task, and let it run. Don’t try to have a dialogue with it — that’s not what it’s built for.
Local models are the backup plan. I run Qwen 3.5 122B (MoE, only 10B active parameters) on a Mac Studio. 24 tokens/second, zero marginal cost. When APIs go down or I’m on a plane, local keeps working.
The subscription plans are insane value. Claude Max and Codex Pro at $200/month each — I’d be spending thousands via API for the same usage. If you’re a solo practitioner, subscribe. The loss-leader pricing won’t last forever.
Multi-model is the move. Don’t pick one model. Use Opus for critical work, Sonnet for tool-heavy tasks, Kimi for cheap analysis, Codex for code generation, local Qwen for experiments. Build workflows that route to the right model for the job. That’s what my agent team does.
The VerdictPermalink
If You’re Paying for APIsPermalink
Daily driver: Kimi K2.5 ($4 per 10K requests, 76.8% SWE-bench)
Critical tasks: Claude Opus 4.5 ($37 per 10K, 80.9% SWE-bench)
Budget option: DeepSeek V3.2 ($2 per 10K, 73.1% SWE-bench)
Build multi-model workflows. Use Cursor or Windsurf with BYOM support. Don’t lock yourself into one vendor.
If You’re Self-HostingPermalink
Best context window: Llama 4 Scout (10M tokens, open source)
Best performance: Qwen3-Coder 480B (69.6% SWE-bench, MIT license)
Best value: GLM-4.7 (73.8% SWE-bench, fits on consumer hardware)
Self-hosting protects you from future price increases and API outages. Worth the infrastructure cost if you’re running high-volume workloads.
If You’re Choosing an IDEPermalink
Best overall: Cursor ($20/mo, BYOM support)
Best value: Windsurf ($15/mo, Arena Mode)
Best for teams: Copilot ($10/mo, $39 business)
All three are good. Cursor has the best UX. Windsurf has the best features-to-price ratio. Copilot has the lowest per-seat cost at scale.
The Real LessonPermalink
Model choice matters less than workflow, tooling, and prompt engineering.
The top five models are within 1% of each other. Kimi K2.5 costs 11% of Claude Opus and delivers 95% of the performance. Open source models are good enough for serious work.
Stop agonizing. Pick a tier, pick a model, build systems around it, ship code.
The AI coding wars are over. Everyone won. Now go build something.
Benchmarks: SWE-bench Verified, LiveCodeBench, SWE-bench Pro. Developer sentiment: Pragmatic Engineer, r/LocalLLaMA, Hacker News. Tool comparisons: TLDL, BuildMVPFast.
Thanks for reading. If this sparked an idea, send it to someone building cool things.