How Not to Burn Your AI Budget in 2026

Multi-model routing is how teams cut their Claude bills 50-75% without losing quality.

Multi-model cost comparison across Claude, Gemini, and open-weight tiers

This is the part of the 2026 AI-budget story that escapes the headlines. Uber burning through their full year's budget in four months is the news-cycle version. Most teams are running the same dynamic on a smaller scale: quieter, slower, but with the bill climbing while shipping throughput holds flat. Same cause both times. Cheap jobs running on expensive models. And with Anthropic's June pricing changes on the horizon, this dynamic gets exponentially worse for teams still routing everything to frontier models.

What an agentic workflow actually is

When a team adopts agentic coding, they tend to talk about "the AI" as if it were one thing. A real coding agent that ships a PR has done twelve to twenty distinct jobs along the way, and the tier they belong to varies wildly.

A few of those jobs genuinely need frontier reasoning: plan generation on a 50,000-line monorepo, architectural reviews that cross three services, root-cause analysis on a flaky test that fails for the third time. Being wrong on any of those costs the team a week downstream, so Opus tokens earn their keep there.

Most of the work in the same pipeline does not qualify:

Triage. Read the inbound ticket, classify it, route to the right workflow. 200 tokens in, 50 out, fixed JSON schema. Gemini 2.5 Flash-Lite at $0.10/$0.40 per million tokens does it cleanly. Opus 4.7 at $5/$25 does the same job for fifty times the cost.
Test generation. Schema-bound, high-volume, easy to verify. Haiku 4.5 ($1/$5) or Devstral Small ($0.10/$0.30) ships better tests per dollar than any frontier model.
PR description writing. Pure summarisation. Gemini Flash-Lite with a cached template prefix lands at fractions of a cent per PR.
Prompt-injection detection on every inbound prompt. Not an LLM job. An 86-million-parameter classifier (Llama Prompt Guard 2) on a CPU, under fifty milliseconds latency, free if self-hosted. Doing this with Sonnet is engineering malpractice.
Embedding retrieval. Two cents per million tokens with text-embedding-3-small. No LLM call in this stage at all.

Route all of those to Opus and you have built the Uber budget-burn machine, in miniature, at your own company. The model is being asked to do work that should have gone to a model fifty times cheaper, in a loop that fires hundreds of times a day.

The numbers

We worked through the math with a customer running Claude Code across a 25-engineer team last month. Their setup was frontier-only. Every call routed to Sonnet or Opus, including triage, PR descriptions, doc updates, and log parsing.

Their monthly Claude bill was $14,000. After we re-routed:

Frontier reasoning (plan generation, architectural review) stayed on Opus 4.7, with prompt-cache prefixes wrapped around the codebase context. Cache hit rate moved from 12% to 71%.
Production code generation: Sonnet 4.6, unchanged.
Triage, classification, log parsing, PR descriptions, doc summaries: routed to Gemini Flash-Lite and Haiku 4.5, with the batch API for jobs that tolerated 24h latency.
Security pre-filter: a self-hosted BERT classifier for secret detection. Free at inference.
Embedding and rerank: text-embedding-3-small and Cohere Rerank 3.5.

The new bill was $4,200. Shipping throughput held flat. PR acceptance rate actually crept up slightly, because the verifier loop got a different vendor's eyes on each diff and caught more of the agent's blind spots.

That outcome is what happens whenever an orchestrator gets routing rights and the team stops treating "Claude" as a synonym for "the agent."

Why this is an orchestrator job

The reason most teams cannot do this on their own is structural. A developer with one Claude Code session is talking to one model at a time. The choice to route a triage call to Gemini Flash-Lite and a plan-generation call to Opus has to be made somewhere above the agent. There is exactly one place in the stack that lives: the orchestrator.

That place is what Ivy Tendril fills. Tendril sits above the coding agent. The orchestrator picks the model per stage: a classifier for triage, a frontier model for planning, a workhorse for code generation. It enforces a token cap before each call, so a runaway retry loop hits a wall before the company credit card hears about it. It tracks which model wrote which line in which region, which is the foundation of any EU compliance story. And it owns the prompt structure, which lets it refactor for prompt-cache hits (90% off on Anthropic, 75% on Google, 40-75% on OpenAI) without the customer touching the underlying agent at all.

The whitepaper line for this is "code- and model-agnostic." In practice, our customers who lived through the May 14 plan split did not need to change their workflow. The orchestrator re-routed for them.

What this means if you're running ad-hoc today

Most teams I talk to are at level 3 on the agentic-coding curve. One developer talking to one agent, with one model behind it. That setup felt like progress two years ago. Today it produces $14,000 monthly Claude bills with no breakdown of where the money went.

The practical first step is to install the layer above the agents that can change models. Once that layer is in place, routing is a configuration file. Without it, every cost conversation ends with "we'll watch our Claude usage more carefully," which is the agentic-engineering equivalent of hoping.

Tendril is open source and free. It deploys onto your Azure, AWS, GCP, or on-prem subscription in minutes. The orchestrator and the framework underneath are both free forever. The packages we offer are for teams that want help wiring it up to their stack faster than they could alone.

If your Claude bill is climbing, the orchestration layer is the piece you're missing. Email renco@ivy.app and we'll show you the routing config.

Talk to us: renco@ivy.app · tendril.ivy.app · github.com/Ivy-Interactive/Ivy-Tendril