End of March - what’s up with Coding Agents?

March 20, 2026 15 minute read

One of the most interesting things in THE WORLD is the rapid advancement of models/coding agents / whatever these things are. Models in a coding harness? Well, there are models, and then harnesses. I guess this is about the models.

As you likely know, if you are coding with ai this month you are either on a subscription plan to Claude Code or Open AI’s Codex. (side note - interesting that it’s not ChatGpt Codex, apparently the MASSIVE name recognition for chatgpt isn’t working for the marketing geniuses at OpenAI) or you are paying by the token via API.
If you are a hobbyist or solo practitioner like me, you are subscribing to the incredible, loss leader deals that Claude and Codex offer at $200 a month. I have both and I never run out of tokens, and by my count I would be spending thousands of $ via the api.

Okay whatever. Here’s the gist - I had my agenty-team-people review what’s available today. Why? I have several openclaw installs with many agents. My token usage is ramping and I’m quite curious about where the future lies when these incredible subscriptions go away.

The following was written by 3 agenty-team-people.
Orchestration: Ops Agent: Opus 4.6 Research: Analytics-Agent: ~~Model Kimi K2.5~~ Sonnet 4.6 (Note: Analytics was supposed to run on Kimi K2.5 (its default model) but I had to override to Sonnet after Kimi spent 22 minutes sending garbage web search queries. Kimi can’t do tool calling reliably — it’s good at reasoning over data that’s already in context, not at gathering it apparently!) Writing: Marketing Agent: Model: Sonnet 4.6

The top five coding models are within one percentage point of each other. This seems kind of remarkable but I don’t know what this really means - are they good at the same things? Some are better at some things than others? No idea.

Claude Opus 4.5 leads at 80.9% on what seems to be the testing standard “SWE-bench Verified”. Fourth place is MiniMax M2.5 at 80.2%. MiniMax is a model from a company in China priced at $0.30/M input tokens instead of Claude’s $5. I note that it’s from China not because the model works in Chinese but because there is a fascination of how great models are coming from China and it’s confusing because they don’t have access to our cutting edge hardware from Nividia. And because maybe they are spying on you. And because

If you’re agonizing over which model to pick, you’re optimizing the wrong thing.

I’ve been using AI coding tools daily since Copilot launched. I run three companies, ship code across multiple stacks, and have strong opinions about what actually matters. Here’s what I learned researching every major coding model in March 2026.

The Three Metrics That Actually MatterPermalink

Efficacy isn’t just benchmarks. SWE-bench Verified has a contamination problem — Claude Opus 4.5 scores 80.9% on the verified set but drops to 45.9% on SWE-bench Pro (the uncontaminated version). That’s a 35-point gap. Models trained on test data.

Look at LiveCodeBench instead. It pulls fresh problems from LeetCode, AtCoder, and CodeForces. Gemini 3 Pro crushes it at 91.7%. DeepSeek V3.2 hits 89.6%. These are clean numbers.

Cost isn’t $/million tokens. That’s an abstraction. Real cost is what you pay per 10,000 coding conversations. Assume ~500 input tokens and ~3,200 output tokens per request (that’s a typical refactoring task). Here’s what it actually costs:

Model	Input $/M	Output $/M	Cost per 10K requests	SWE-bench
Claude Opus 4.5	$5.00	$25.00	~$37	80.9%
Gemini 3.1 Pro	$2.50	$15.00	~$22	80.6%
GPT-5.2	$1.75	$14.00	~$20	80.0%
Kimi K2.5	$0.45	$2.20	~$4	76.8%
DeepSeek V3.2	$0.27	$1.00	~$2	73.1%

Kimi gives you 95% of Claude’s performance at 11% of the cost. DeepSeek gives you 90% at 5% of the cost.

Buzz is what developers actually use. Not what benchmarks say they should use.

According to Pragmatic Engineer’s survey, “Anthropic has become the go-to model developer for coding-related work — Opus 4.5 and Sonnet 4.5 come up more often than all other models, combined.”

Claude Sonnet 4.5 is #6 on SWE-bench Verified but #1 in developer hearts. Why? Trust. Ecosystem. Error handling. Edge cases. The stuff benchmarks don’t measure.

The Leaderboard (Reality Edition)Permalink

Tier 1: Flagship — Pay for Peace of MindPermalink

Best in Tier: GPT-5.2 ($1.75/M input, 80.0% SWE-bench)

Model	Cost/10K	SWE-bench	Why Use It
Claude Opus 4.5	$37	80.9%	Developer trust, best error recovery
Claude Opus 4.6	$37	80.8%	Latest, 1M context at standard pricing
Gemini 3.1 Pro	$22	80.6%	Strong on fresh benchmarks (91.7% LiveCodeBench)
GPT-5.2	$20	80.0%	Best value in tier, 400K context

When to pay up: Mission-critical refactoring. Large enterprise codebases. When you can’t afford to debug bad suggestions.

The honest take: These four are statistically tied. Pick based on ecosystem. If you’re using Cursor or Windsurf, you can switch between all of them. Don’t overthink it.

Tier 2: Sweet Spot — 90% Quality, 30% CostPermalink

Best in Tier: Kimi K2.5 ($0.45/M input, 76.8% SWE-bench)

Model	Cost/10K	SWE-bench	Why Use It
Kimi K2.5	$4	76.8%	Best value in market, native multimodal
MiniMax M2.5	$2	80.2%	Dark horse, #4 overall, shockingly good
Claude Sonnet 4.6	$25	79.6%	Developer favorite, most trusted
Gemini 3 Flash	$5	78.0%	Flash pricing, flagship performance
Qwen3.5-397B	$6	76.4%	Open source, strong agentic capabilities

This is where I live. Kimi K2.5 at $4 per 10K requests vs Claude Opus at $37 is a no-brainer for 90% of my work. I use Opus when I’m refactoring something critical. Everything else? Kimi or MiniMax.

MiniMax M2.5 is the surprise here. #4 on SWE-bench Verified, beating every OpenAI model except GPT-5.2. Chinese model, barely any developer buzz, but the numbers don’t lie. At $2 per 10K requests, it’s absurdly good value.

Tier 3: Budget Champions — Pennies, Not DollarsPermalink

Best in Tier: DeepSeek V3.2 ($0.27/M input, 73.1% SWE-bench)

Model	Cost/10K	SWE-bench	Why Use It
DeepSeek V3.2	$2	73.1%	98% cheaper than Claude, 89.6% LiveCodeBench
MiMo-V2-Flash	$1	73.4%	Cheapest option that doesn’t suck
Gemini 2.5 Flash	$4	60.4%	Free tier available, Google ecosystem

DeepSeek is the open source champion. 73.1% on SWE-bench at $0.27/M input. That’s 98% cheaper than Claude Opus. Five cents vs five dollars per million input tokens.

Real-world performance is mixed. Some developers love it, others report it’s hit-or-miss on complex tasks. But at this price, you can afford to regenerate a few times.

The Developer Favorites (Buzz Rankings)Permalink

Reddit, Hacker News, and Twitter paint a different picture than benchmarks.

#1: Claude Sonnet 4.5 — “Generally considered the best coding model all around” (r/ChatGPTCoding)

Sonnet 4.5 is #6 on SWE-bench Verified. But every developer thread defaults to “just use Sonnet.” Why? It nails the edge cases. Error messages are clear. It doesn’t hallucinate package names. When it refactors, it maintains style consistency.

Benchmarks measure average case. Developers care about worst case. Claude wins on trust.

#2: Kimi K2.5 — “Costs almost 10% of what Opus costs at similar performance” (r/LocalLLaMA)

The value pick. 76.8% on SWE-bench, $0.45/M input. Native multimodal (handles images and text). One-trillion-parameter model that nobody’s heard of, punching way above its weight.

I’ve been using Kimi K2.5 for three weeks. It’s shockingly good. Not quite Claude on complex refactoring, but 90% of the way there at 11% of the cost.

#3: MiniMax M2.5 — “Is this the best coding model in the world?” (r/LocalLLaMA)

Another Chinese model. #4 on SWE-bench Verified (80.2%), beating GPT-5.2. At $0.30/M input, it’s in the same price tier as DeepSeek but performs 7 percentage points higher.

No buzz outside Reddit’s LocalLLaMA community, but the numbers are real. If you’re cost-sensitive and willing to try something off the beaten path, MiniMax delivers.

The Benchmark Reality CheckPermalink

SWE-bench Verified is contaminated. The gap between Verified and Pro scores proves it:

Model	Verified Score	Pro Score	Gap
Claude Opus 4.5	80.9%	45.9%	-35 points
GPT-5	74.9%	23.3%	-51 points
Claude Opus 4.1	74.5%	23.1%	-51 points

These models were trained on problems that overlap with SWE-bench Verified. When you test them on fresh, uncontaminated problems (Pro), performance craters.

So what’s actually reliable?

LiveCodeBench pulls fresh problems from competitive programming sites. It’s harder to game:

Model	LiveCodeBench	Developer
Gemini 3 Pro	91.7%	Google
Gemini 3 Flash (Reasoning)	90.8%	Google
DeepSeek V3.2	89.6%	DeepSeek
Claude Opus 4.x	85.0%	Anthropic
Qwen 3 235B	74.1%	Alibaba

Gemini is crushing it on clean data. But where’s the developer buzz? Nowhere. Reddit barely mentions it. Cursor and Windsurf default to Claude and GPT-5.

Why the disconnect?

Gemini’s API has rough edges. Long-context pricing jumps sharply above 200K tokens. Error messages are cryptic. The ecosystem hasn’t caught up.

Claude and OpenAI win on polish, not raw capability.

HumanEval is saturated. Claude Opus scores 99%. Everyone else is 88-90%. It’s no longer a differentiator. Stop citing HumanEval in 2026.

The Open Source SurprisePermalink

2024: “Open source will never catch up to proprietary models.”

2026: MiniMax M2.5 is #4 on SWE-bench Verified. DeepSeek V3.2 is 98% cheaper than Claude at 90% of the performance.

Self-hosting is now viable for serious work.

Hacker News, March 2026: “Kimi K2 on a pair of Mac Studios — 24 tokens/second, negligible power cost vs developer salary.”

Llama 4 Scout has a 10 million token context window. That’s an entire codebase in one shot. Open source. Free to run locally.

Qwen3-Coder 480B hits 69.6% on SWE-bench Verified. MIT/Apache 2.0 license. Deploy it anywhere.

GLM-4.7 (358B parameters, MIT license) scores 73.8% on SWE-bench. You can run this on a cluster of consumer GPUs.

The cost hedge is real. Current API pricing is subsidized VC money. Hacker News sentiment: “Their pricing models are simply not sustainable. Like cab hailing, shopping, social media ads… prices will start going up with nowhere to run.”

Build multi-model workflows now. Use Claude for critical refactoring, Kimi for everything else, DeepSeek for batch jobs, Llama 4 for local experimentation. Don’t lock yourself into one vendor.

The Tools That MatterPermalink

Models are commodities. Tools are differentiators.

IDEs: Cursor vs Windsurf vs CopilotPermalink

Tool	Default Models	Pricing	Why Use It
Cursor	GPT-5.x, Claude Sonnet 4.5	$20/mo	#1 choice, best UX, BYOM support
Windsurf	Claude Sonnet 4.5, SWE-1.5	$15/mo	80% of Cursor at 75% price, Flow-state awareness
Copilot	GPT-4o, GPT-5.x	$10/mo	Tightest GitHub integration, lowest team cost

I use Cursor. Everyone I know uses Cursor. It’s the default now.

Windsurf is interesting. Flow-state awareness (it tracks what you’re focused on) and Arena Mode (compare models side-by-side in real-time) are killer features. At $15/mo, it’s the value pick.

Copilot is for teams locked into GitHub/Microsoft. $39/seat/month for business. Lowest cost per developer at scale.

All three support BYOM (bring your own model). Cursor and Windsurf let you plug in any OpenAI-compatible API. Use Kimi, MiniMax, or DeepSeek if you want.

CLI Agents: Claude Code vs Codex vs AiderPermalink

Tool	Default Model	Pricing	Why Use It
Claude Code	Claude Sonnet 4.5	API costs only	Most trusted for complex refactoring
OpenAI Codex	GPT-5.2 Codex	API costs only	Best for automated PR workflows
Aider	Claude, GPT-5.x	Free + API	Git-native, beloved by power users

Claude Code is the official Anthropic CLI. I use it for refactoring sessions. It’s the most reliable for multi-step edits across dozens of files.

Codex (OpenAI’s CLI agent) excels at automated workflows. Spawn it in CI, give it a task, it opens a PR. GPT-5.2 Codex scored 64% on Terminal-Bench 2.0 (agentic terminal tasks). GPT-5.3 Codex hit 77.3%.

Aider is for git power users. It speaks in commits and diffs. Free, open source, supports every model via API.

The Agent ShiftPermalink

We’re moving from autocomplete → autonomous agents.

Cursor Agent Mode, Windsurf Cascade, Codex auto-PR — these tools take a goal, break it into steps, execute, debug, and iterate until done.

SWE-bench measures this capability. So does Terminal-Bench 2.0. Models are getting better at multi-step reasoning, but they’re still fragile. You need to check their work.

Dario Amodei (Anthropic CEO, March 2026): “Within six months 90% of all code will be written by AI.”

Maybe. But 100% of code will still be reviewed by humans. For now.

The Ship Velocity WarPermalink

OpenAI shipped three major GPT-5 releases in three months:

GPT-5.2: December 2025
GPT-5.3: March 2026
GPT-5.4: March 2026 (two days later)

Developer reaction: Fatigue.

Reddit, March 2026: “Three months between 5.2 and 5.3, then two days to 5.4? What are we even testing here?”

Contrast this with Anthropic. Two releases in the same period:

Claude Opus 4.5: November 2025
Claude Sonnet/Opus 4.6: February 2026

Both releases dominated SWE-bench. Both earned developer trust. Slow and steady wins.

Rapid releases feel like thrashing, not progress. Developers want stability. We build workflows around models. Switching costs are high. Ship less, ship better.

The New Hotness (March 2026)Permalink

The leaderboard is moving fast. Here’s what just landed or is generating serious heat this month:

GPT-5.4 — OpenAI’s “We’re Back” MomentPermalink

Dropped March 5th — two days after GPT-5.3. Native computer use, 1M context in Codex, 57.7% on SWE-bench Pro (the uncontaminated benchmark). That Pro score is the real story — it’s the highest any OpenAI model has hit on clean data.

The vibe from Every.to: “I’m reaching for GPT-5.4 more than Codex 5.3 — not because it’s dramatically more intelligent on raw coding quality, but because it’s much better to work with moment to moment.”

GPT-5.4 Mini and Nano also just shipped — near-flagship performance at fraction of the cost. This is where the price war gets real.

MiMo-V2-Pro (a.k.a. “Hunter Alpha”) — Xiaomi’s Stealth BombPermalink

This one’s wild. Xiaomi dropped a trillion-parameter model that was quietly running on OpenRouter under the codename “Hunter Alpha” — racking up 1 trillion tokens before anyone knew who made it.

The numbers: 75.7 on Claw-Eval (3rd globally, behind only Claude Opus 4.6), beats Claude Sonnet 4.6 at coding, approaching Opus-level on agent tasks — at 67% lower cost. VentureBeat called it “stunning.”

A phone company is making one of the best coding models in the world. We’re in that timeline now.

Grok 4.20 — xAI’s Multi-Agent ArchitecturePermalink

Elon’s team went a completely different direction: four AI agents running in parallel. 75% on SWE-bench, 79.6% on Aider Polyglot. The architecture is genuinely novel — not just a bigger transformer.

Early developer reaction is positive but cautious. The multi-agent approach means it handles complex, multi-file tasks differently than single-model approaches.

Seed 2.0 Pro — ByteDance Enters the ArenaPermalink

ByteDance’s play. 76.5% SWE-bench Verified, AIME 98.3, Codeforces rating of 3020. Released February 14th, strategically timed right before China’s Spring Festival.

Three variants: Pro (flagship intelligence), Lite (balanced), Mini (speed/cost). The Pro tier is legitimately competing with Claude and GPT-5.x on coding. ByteDance has the infrastructure to subsidize pricing aggressively.

Nemotron 3 Super — Nvidia’s Efficiency PlayPermalink

Nvidia isn’t competing on raw benchmark scores — Nemotron 3 Super sits at #26 on LiveCodeBench. But it’s doing something different: 2.2x higher throughput than comparable open models, 1M context window via hybrid Mamba-Transformer architecture, and it runs on Nvidia’s own hardware stack.

The real play: if you’re running your own inference cluster on Nvidia GPUs, this is built for you. Greptile’s review: “Punches far above its weight class.”

Step-3.5-Flash — The Price DestroyerPermalink

StepFun’s model. 74.4% SWE-bench Verified at $0.10/M input, $0.40/M output. That’s $0.63 per 10K coding requests. Less than a dollar for ten thousand conversations.

196B parameters. Nobody talks about it. The numbers are insane for the price.

GLM-5 — Zhipu AI’s Quiet ClimbPermalink

78.0% SWE-bench Verified. 744B parameters. $1.00/M input, $3.20/M output. Zhipu AI has been steadily climbing the leaderboard without any of the hype. GLM-5 is now #8 overall, ahead of Kimi K2.5.

If the Chinese model ecosystem keeps shipping at this pace, the pricing floor for top-tier coding models is going to zero.

What I Actually RunPermalink

Full transparency — here’s my real stack. I run OpenClaw, an AI agent orchestration platform, with 8 functional agents across multiple businesses. This isn’t theoretical.

My Daily ModelsPermalink

Model	What I Use It For	Monthly Spend
Claude Opus 4.6	Primary interface (COO agent), complex orchestration	Heavy — it’s my main brain
Claude Sonnet 4.5	Marketing, Design, Support, Ops agents	Moderate — workhorse for tool-heavy tasks
GPT-5.2 Codex	Engineering agent — pure code generation	On-demand — spawned for coding tasks
Kimi K2.5	Finance, Analytics agents — data analysis	Light — great for reasoning, terrible at tool calling
Qwen 3.5 122B	Local inference on Mac Studio	Free (electricity only)
Qwen 2.5 14B/7B	Edge inference on home server	Free

Lessons from Actually Running Multi-ModelPermalink

Kimi K2.5 can’t do tool calling. It spent 22 minutes sending garbage web search queries when I asked it to research this very blog post. Great at reasoning over data already in context. Terrible at gathering data via tools. I override to Sonnet for any agentic work.

Codex is a beast in a harness, mediocre in conversation. GPT-5.2 Codex shines when you give it a codebase, a task, and let it run. Don’t try to have a dialogue with it — that’s not what it’s built for.

Local models are the backup plan. I run Qwen 3.5 122B (MoE, only 10B active parameters) on a Mac Studio. 24 tokens/second, zero marginal cost. When APIs go down or I’m on a plane, local keeps working.

The subscription plans are insane value. Claude Max and Codex Pro at $200/month each — I’d be spending thousands via API for the same usage. If you’re a solo practitioner, subscribe. The loss-leader pricing won’t last forever.

Multi-model is the move. Don’t pick one model. Use Opus for critical work, Sonnet for tool-heavy tasks, Kimi for cheap analysis, Codex for code generation, local Qwen for experiments. Build workflows that route to the right model for the job. That’s what my agent team does.

The VerdictPermalink

If You’re Paying for APIsPermalink

Daily driver: Kimi K2.5 ($4 per 10K requests, 76.8% SWE-bench)
Critical tasks: Claude Opus 4.5 ($37 per 10K, 80.9% SWE-bench)
Budget option: DeepSeek V3.2 ($2 per 10K, 73.1% SWE-bench)

Build multi-model workflows. Use Cursor or Windsurf with BYOM support. Don’t lock yourself into one vendor.

If You’re Self-HostingPermalink

Best context window: Llama 4 Scout (10M tokens, open source)
Best performance: Qwen3-Coder 480B (69.6% SWE-bench, MIT license)
Best value: GLM-4.7 (73.8% SWE-bench, fits on consumer hardware)

Self-hosting protects you from future price increases and API outages. Worth the infrastructure cost if you’re running high-volume workloads.

If You’re Choosing an IDEPermalink

Best overall: Cursor ($20/mo, BYOM support)
Best value: Windsurf ($15/mo, Arena Mode)
Best for teams: Copilot ($10/mo, $39 business)

All three are good. Cursor has the best UX. Windsurf has the best features-to-price ratio. Copilot has the lowest per-seat cost at scale.

The Real LessonPermalink

Model choice matters less than workflow, tooling, and prompt engineering.

The top five models are within 1% of each other. Kimi K2.5 costs 11% of Claude Opus and delivers 95% of the performance. Open source models are good enough for serious work.

Stop agonizing. Pick a tier, pick a model, build systems around it, ship code.

The AI coding wars are over. Everyone won. Now go build something.

Benchmarks: SWE-bench Verified, LiveCodeBench, SWE-bench Pro. Developer sentiment: Pragmatic Engineer, r/LocalLLaMA, Hacker News. Tool comparisons: TLDL, BuildMVPFast.

Thanks for reading. If this sparked an idea, send it to someone building cool things.

Share on

X Facebook LinkedIn Bluesky