Multi-Model Routing Is Not an Architecture Experiment Anymore

A year ago, multi-model routing was a demo. Something you'd see in a hackathon or a conference presentation. The idea was simple enough: not every task needs your best, most expensive model. Route the simple stuff to something cheaper. Reserve the reasoning heavy-hitters for when you actually need them.

The theory was sound. The practice was immature. The tooling was rough. Most production systems stuck with one model because the operational complexity of managing multiple wasn't worth the economics.

That calculus has shifted. The tooling caught up. The model ecosystem fragmented in ways that make single-model approaches feel limiting. And the cost differential between model tiers has gotten wide enough that leaving it on the table is a real decision.

What Multi-Model Routing Actually Means in Production

Not what it means in demos. In production.

A routing layer sits between your application and your model calls. Before sending a request, it classifies the request — by task type, by required capability, by expected output length, by whether the task requires reasoning or retrieval or generation or all three — and routes to the appropriate model.

The classification itself can be a model call (small, cheap, fast — something that costs fractions of a cent and adds sub-50ms latency). Or it can be rule-based if your task taxonomy is well-defined enough. Usually it's both: rules for the obvious cases, a classifier for the ambiguous ones.

The downstream model selection is where it gets interesting.

The Three-Tier Pattern

Most production multi-model systems I've seen end up with something like three tiers.

Tier 1 — Fast/Cheap. Small models, sub-500ms end-to-end, optimized for straightforward generation, classification, and extraction tasks. The stuff that makes up 60-70% of your volume by request count. Think reformatting, simple Q&A, basic summarization. This is where the cost savings live.

Tier 2 — Capable/Balanced. Mid-range frontier models — the ones that sit below the flagship pricing but above the small models. Good reasoning, solid context handling, reasonable cost. This covers the bulk of what's actually hard — tasks that need some genuine intelligence but don't require the top of the capability curve.

Tier 3 — Best Available. The flagship reasoning models. Reserved for tasks that genuinely require them: complex multi-step reasoning, ambiguous high-stakes decisions, synthesis across large amounts of context. If you're doing this right, Tier 3 handles maybe 10-15% of your requests by count but probably 30-40% of your cost.

The specific models in each tier are less important than the taxonomy. The framework is durable even as the models change.

What the Tooling Looks Like Now

Six months ago, building this yourself was the only option. Now there are frameworks that handle the plumbing. The major cloud AI platforms have added routing primitives — Azure AI Foundry's multi-provider support, the inference routing features that several of the AI-native platforms shipped in the last quarter.

I'm still wary of fully-managed routing where you give up control of the routing logic. The cases where you need to override the router — where the classification is wrong, where a task type is ambiguous, where you're debugging a production failure — those cases are where control matters. The fully-managed approach trades control for convenience, and the convenience isn't yet worth the tradeoff at production scale for most use cases.

Building your own routing layer is not that much code. The complexity is in the task taxonomy and the evaluation — getting good signal on whether your routing decisions are actually improving outcomes. That's the hard work, and no framework does it for you.

The Evals Problem

Here's what nobody tells you about multi-model routing until you're debugging it at 2am: your evals need to cover the routing decisions, not just the model outputs.

A system where the right answer is generated by the wrong model — too expensive, too slow, wrong capability profile — is still a failure. A system where the router sends a task to Tier 1 that needed Tier 3 and gets a degraded answer is a failure. The evaluation framework has to measure routing accuracy, not just output quality.

Most teams get this backwards. They optimize for output quality in isolation and discover the routing failures only when something breaks in production.

Get the evals right first. The routing architecture is the easy part.

Measure the routing, not just the output.

— Dustin