The Token Price Floor Is Gone. Now What?
The conversation about AI costs used to be simple. Expensive model, expensive inference, expensive mistake. Run a search query through a frontier model and wince at the bill. Batch it carefully. Cache everything you can.
That's not the world anymore.
Over the past 18 months, input token prices across the major API providers have dropped somewhere between 80 and 97 percent depending on the model tier you're comparing. What cost $10 per million tokens in early 2024 costs under $0.50 in many cases today. The price floor didn't move — it collapsed.
And the architectural implications are still catching up.
What Actually Happened
The short version: once you've trained a model, inference is a fixed compute problem that scales with hardware and competition. Moore's Law equivalent for GPU compute kept pushing through. And competition — both from frontier labs running at massive scale and from open-weight models now running on commodity hardware — destroyed the ability to extract economic rent from access.
The providers that used to charge for having access to capable models are now charging for volume. That's a fundamentally different market structure.
It compounds. Cheaper tokens mean more tokens in more prompts. More tokens mean longer context windows make business sense to use. Longer context means certain things you'd have built as retrieval pipelines can just... go in the context. RAG made a lot of sense at $20 per million tokens. At $0.30, the math changes.
The Architectural Shift Nobody's Talking About
When inference was expensive, architectures optimized for prompt brevity. Short system prompts. Tight context windows. Routing to cheaper smaller models whenever possible. Hard cutoffs on generation length.
These were cost-driven decisions, not performance-driven ones.
When the cost constraint loosens, you start making different calls. More context. Longer chains. Model switching as a quality lever rather than a cost lever. Patterns that were prohibitively expensive for production — dense reasoning passes, multi-turn orchestration, verification loops — start making sense in a different cost environment.
The problem: most production architectures were built under the old math and haven't been rethought.
I've been through this myself. Systems designed with heavy caching layers and aggressive token budgets because the economics demanded it. Some of those decisions still make sense. Others are now pure overhead — complexity without benefit.
Anything you built for inference cost efficiency before 2025 deserves a fresh look. Not necessarily a rewrite. But the assumptions deserve pressure.
Where It Doesn't Change Things
Latency is still real. A 200ms inference call in a user-facing flow doesn't get faster because the tokens are cheaper.
Quality variance is still real. Cheaper models still exist, still produce worse outputs, and are still the wrong call for reasoning-heavy tasks regardless of what they cost.
And operational complexity is still real. Cheaper tokens don't simplify your observability, your evals, or your failure handling. Those costs shifted from dollars to engineering time a while ago. That ratio hasn't changed.
The Real Question
The token price floor being gone doesn't mean inference is free. It means a different set of architectural constraints are now binding.
Figuring out which constraints those are — for your specific system, your specific workload, your specific quality requirements — is the actual work.
The models are cheap. The thinking still costs what it always did.
Build like the floor keeps dropping. It will.
— Dustin