Stop Overpaying for Intelligence

Fri, 24 Apr 2026 00:00:00 +0000

The default behavior of most AI-powered products today is simple: when in doubt, call the frontier model.

Every customer message classified by GPT-5. Every extraction task routed through Claude. Every prompt, no matter how trivial, handled by the same multi-trillion-parameter machine that was designed to reason about legal strategy, write pharmaceutical patents, and debug distributed systems.

It works. It’s fast to integrate. It makes the product feel intelligent.

And it is quietly becoming one of the most expensive habits inside the modern AI stack.

The Overpaying Default
#

A few weeks ago I wrote about the end of cheap AI — the moment when subscription limits, rate caps, and honest inference costs finally started reflecting the real economics of frontier models. That’s the macro story.

This is the micro story that sits underneath it.

The reason so many companies are about to feel the squeeze is not only because frontier prices are rising. It’s because the typical architecture was built on a silent assumption: that there was no cost worth worrying about, so the best model should handle everything. That assumption is breaking down from two sides at once.

From one side, frontier inference is getting more expensive, metered harder, and subsidized less.

From the other side — and this is the part most roadmaps haven’t priced in yet — local and open-weight models have quietly become good enough for a very large share of real enterprise tasks.

That combination changes the economics of AI more than any single product announcement this year.

What Local Models Can Actually Do Now
#

A few years ago, “run your own LLM” meant a heroic engineering project, a clear downgrade in quality, and an infrastructure team that secretly missed the cloud.

Today, it doesn’t.

The current generation of open-weight models — Llama, Qwen, Mistral, DeepSeek, Gemma, and their derivatives — has crossed capability thresholds that would have sounded like science fiction in 2023. A 70B-parameter open-weight model running on a single high-end workstation or a modest GPU instance now performs competitively on the benchmarks that mattered most to enterprises at the start of this cycle: general reasoning, code completion, summarization, extraction, translation, structured output.

And it keeps getting better. Fast.

This doesn’t mean frontier and open-weight are interchangeable. They are not. Frontier still pulls clearly ahead on long-context coherence, multi-step agentic planning, novel domain synthesis, and the hardest tiers of code generation.

But what matters for an AI roadmap is not whether local models have caught up on everything. It is whether they are good enough on the specific tasks your system actually performs.

And for most enterprise workloads in 2026, the answer is increasingly yes.

The Map Most AI Architectures Are Missing
#

If you look carefully at what happens inside most AI-powered applications, the workloads split cleanly into two groups.

Tasks that genuinely need a frontier model:

Long-context reasoning across dozens of documents.
Multi-step agentic planning over ambiguous goals.
Complex code generation from scratch in unfamiliar domains.
Creative synthesis that blends multiple expert voices.
Handling highly adversarial or edge-case inputs that require real judgment.

Tasks that almost certainly don’t:

Classifying an email, ticket, or document by type.
Extracting entities, dates, and amounts from a text.
Summarizing a page or two of content.
Rewriting a paragraph in a different tone.
Producing structured output (JSON, SQL) from plain text.
Translating between major languages.
Answering FAQs from a retrieval layer.
Deterministic sub-steps inside a larger agent.

Most AI architectures treat these two groups the same way. They shouldn’t.

The job of any mature AI stack — and I use “mature” here in the adult-phase sense, as opposed to the sugar-rush phase of the last two years — is to route each task to the right tier. Frontier when it earns it. Open-weight when it doesn’t.

The word I’ve been using internally for this is task discrimination. Not in the political sense — in the architectural one. The ability to recognize that different tasks deserve different intelligence budgets, and to design accordingly.

It’s Not Just About Cost
#

Cost is the most visible reason to care about task discrimination. It is not the only one.

There are four other reasons that keep compounding the more deeply an organization uses AI.

Latency. A local 8B or 13B model running next to your application can return a classification in under 100 milliseconds. A round-trip to a frontier cloud API is rarely that fast. For interactive experiences, user-facing agents, or high-frequency internal automations, that gap matters.

Privacy and data residency. Routing every customer email, patient chart, student record, or internal memo through a third-party model is a governance posture that is aging badly. Regulators have noticed. Boards have noticed. For an increasing number of use cases — health, education, legal, defense, government, and anything covered by local data protection regimes — local inference is not an optimization. It is a requirement.

Reliability. When your architecture depends on a single frontier provider, you also depend on their rate limits, their subscription restrictions, their outages, and their commercial roadmap. That is a level of systemic dependency that would raise eyebrows in any other part of the tech stack.

Determinism and control. A smaller model you fully control, fine-tuned or prompt-tuned for a narrow task, often behaves more predictably than a generalist frontier model optimized to handle the entire universe. Predictability is underrated until it is missing.

None of these points is, on its own, a reason to abandon frontier models. All of them together are a reason to stop defaulting to frontier models for everything.

The Numbers Are Not Subtle
#

Let me illustrate with a simple scenario.

Imagine a mid-sized organization running a million lightweight AI calls a month: a mix of classification, extraction, summarization, and structured output. Say the average call uses around a thousand tokens in and out.

Routed through a top-tier frontier model, the inference bill for those calls lands comfortably in the tens of thousands of euros per month. Multiply by twelve, add growth, and this is the kind of line item that starts showing up in CFO reviews.

The same workload, routed through a well-hosted open-weight model — either on-prem or on a dedicated GPU instance at a specialized provider — comes out an order of magnitude cheaper, sometimes two. And the quality difference, on precisely these task types, is typically invisible to end users.

That is not a rounding error. That is the difference between AI being a sustainable operational capability and AI being a line item your CFO starts questioning at every forecast.

And the organizations that realize this first will not use the savings to shrink. They will use them to scale further.

What This Looks Like On My Own Desk
#

The most honest way to write about task discrimination is to describe what I actually run, not what I think other people should run.

In my own setup, I have a Mac Studio dedicated to serving local models to my agents. It sits quietly on a shelf, publishes a private inference endpoint through LM Studio, and hosts a small library of open-weight models optimized for MLX — the framework that lets these models take full advantage of Apple Silicon’s GPU and unified memory.

Nothing about that machine is exposed to the public internet. The endpoint lives inside my own network, behind the boundaries any serious setup demands. For the kind of work I route through it, that is not optional.

I chose a Mac Studio over the obvious alternative — a dedicated GPU rig — for reasons that are not purely technical. It is powerful enough for the model sizes that actually matter to me. It is extraordinarily reliable. It is almost perfectly silent. And its idle power draw is low enough that I can leave it on 24/7 without thinking twice. None of that matters when you are renting H100s by the hour. It matters a lot when the machine is a permanent piece of your operating stack.

The architecture itself is deliberately simple.

The main orchestrator — the LLM that gives my agents their judgment and planning capability — is a frontier model. That is where the hard reasoning happens, where ambiguity has to be resolved, where the whole plan needs to hold together. For that role, paying for the best is worth it.

But underneath the orchestrator, routing rules push subagent tasks to my local endpoint whenever it is possible or recommended. Local handles the grunt work. Frontier handles the thinking.

The result is that my frontier bill has collapsed without any perceptible loss of quality in the end-to-end experience. Not because local has caught up on everything — it has not — but because a very large share of what any agent actually does is not reasoning in the hard sense. It is classifying. Extracting. Summarizing. Reformatting. Translating. Producing structured output.

Models like qwen3.6-35b-a3b-ud-mlx, gemma-4-31b-it-mlx, or gpt-oss-20b-mlx handle these tasks beautifully. Running locally. With latencies a cloud round-trip cannot match. And without sending a single byte of context to a third party.

That is not a theoretical architecture. That is what is running on my desk, today.

So What Should Actually Change?
#

There is no need to rip anything out. There is a need to rearchitect.

At least across five fronts.

1. Build a task taxonomy
#

Every AI call in your product or operations belongs to a complexity tier. Map them. Most teams discover that more than half of their calls sit comfortably in the “does not need frontier” bucket — and have been happily paying frontier prices for them for years.

2. Start with a router, not a migration
#

The highest-leverage first step is not swapping out your model. It is adding an intelligent routing layer — sometimes as simple as “classify intent, then dispatch” — that sends trivial tasks to a cheaper tier and escalates only when confidence is low or complexity is high.

3. Measure cost and quality per task, not per model
#

The question “which model is best?” is the wrong one. The right question is “which model is best for this specific task at this specific cost?” Build the observability that answers that.

4. Treat local as a capability, not a downgrade
#

Open-weight models are no longer a consolation prize. In many workflows they are the right tool — faster, cheaper, more private, more controllable. The teams still talking about them defensively are signaling how recently they last looked.

5. Design for hybrid as the default
#

The interesting AI architectures of 2026 will not be pure-frontier or pure-local. They will be orchestrated systems that blend a frontier model for the hard parts, open-weight models for the routine parts, and fine-tuned small models for the narrow, high-volume parts — each one called when, and only when, it earns its keep.

The Real AI Cost Lever of 2026
#

The dominant narrative this year will continue to focus on the frontier: bigger models, higher benchmarks, sharper capabilities. That narrative is real, and it matters.

But underneath it, there is a quieter shift that will determine which organizations actually build sustainable AI operations — and which ones end up rationalizing aggressive cost cuts in 2027.

The shift is not about choosing between frontier and local. It is about learning to use both, deliberately, at the right moments, in the right combinations.

The cheapest AI optimization available in 2026 is not a better deal from your current provider.

It is the decision to stop using a frontier model for work a local model can do just as well.

Intelligence is becoming abundant. Discernment is becoming the scarce resource.

The companies that will win the next phase of this cycle are not the ones paying the most per call.

They are the ones who have figured out which calls don’t need to be paid at all.

OpenSource on Carles Abarca

Stop Overpaying for Intelligence

The Overpaying Default #

What Local Models Can Actually Do Now #

The Map Most AI Architectures Are Missing #

It’s Not Just About Cost #

The Numbers Are Not Subtle #

What This Looks Like On My Own Desk #

So What Should Actually Change? #

1. Build a task taxonomy #

2. Start with a router, not a migration #

3. Measure cost and quality per task, not per model #

4. Treat local as a capability, not a downgrade #

5. Design for hybrid as the default #

The Real AI Cost Lever of 2026 #