Not Every Prompt Needs GPT-4: Smart Model Routing With a PHP Classifier

If you are building anything that calls an LLM from a Laravel application, you have probably noticed the same thing I did: your bill is dominated by prompts that did not need a frontier model to answer. A user types "what are your business hours?" and it hits GPT-4. A support bot summarizes a log file through Claude Opus. A form validator fires o1 to check if an email looks valid. You are paying premium inference prices for questions a 4o-mini or Haiku could handle in half the time and a tenth of the cost.

The problem is not that cheap models exist. They do, and they are good at plenty of things. The problem is that your application has no way to know, before opening a socket to a provider, whether a given prompt is going to be easy or hard. So you either route everything to the expensive model and overpay by default, or you build a routing heuristic that breaks the first time a user phrases something differently. Neither is great, and both get worse the more requests you handle.

How we fake difficulty today (and why it fails)

If you go looking, the workarounds all share the same shape. Some teams route by prompt length, on the theory that longer means harder. In practice, a one-line question about Italian contract law is short and genuinely difficult, while a two-page pasted server log that you want summarized is trivial for any model. Length measures typing, not difficulty.

Others maintain a keyword dictionary: anything containing "legal," "code," or "calculate" goes to the premium tier. This works for about a week. Then you are maintaining a dictionary forever. It misses every phrasing you did not anticipate. It has no opinion about prompts in a language you did not hard-code.

The most honest attempt is to ask an LLM to rate the difficulty of the prompt before answering it. The problem is that you are now paying for a model call, and waiting for it, in order to decide whether to make a model call. You have added latency and cost to the exact path you were trying to make cheaper.

A score that comes from your own models

The right question is not "is this prompt hard in the abstract?" It is "do the models I actually use find this prompt hard?" Those are different questions, and the second one is the only one that matters when you are deciding which of your providers should answer.

The new package neuron-core/llm-classifier builds a small classifier that reads an incoming prompt and returns a difficulty score between 0 and 1, where 0 means your models find this easy and 1 means they struggle. The important word there is your. The score is learned from the models you actually route between -- GPT-4o-mini, GPT-4o, o1, Claude Haiku, Claude Opus -- so the classifier's idea of "hard" matches your infrastructure's actual behavior.

And it runs in pure PHP. The only extension required is ext-mbstring. No Python sidecar, no GPU, no inference server. Training happens once, offline. Scoring runs in microseconds, in-process, before you ever open a socket to a provider.

1composer require neuron-core/llm-classifier

Two phases, kept strictly apart

Calibration is the offline phase. You give the calibrator a panel of your models, a corpus of example prompts, and a way to grade whether each model answered correctly. It works out which patterns trip them up and writes a single model.bin file. You commit it alongside your code. When your model lineup changes, re-run calibration and replace the binary.

Scoring is the runtime phase. You load model.bin once -- on app boot, or inside Octane, RoadRunner, or FrankenPHP workers -- then call it on each request. Train time and run time never touch.

The package ships with a ready-to-use dataset derived from RouterBench: ~36,000 prompts evaluated against 11 common LLMs. A stratified sample of ~1,845 prompts with precomputed difficulty labels is included. Download the fastText vectors, run the calibration script, and you have a working model in under two minutes.

How the scoring actually works

The package uses a free word vector dictionary from fastText -- a 300-dimensional embedding space where "buy" and "purchase" land close together and "king" and "carburetor" land far apart. Each incoming prompt is reduced to one averaged fingerprint of those vectors, and that fingerprint is the classifier's only input.

The overall() method returns the maximum across per-capability scores, not the average. A prompt that is hard at one thing and trivial at five others should be treated as hard. An average would quietly water that down.

Plugging it into the Neuron AI router

 1$scorer = Classifier::load('storage/model.bin');
 2 
 3return RouterProvider::make()
 4    ->addProvider('mini', new OpenAI(model: 'gpt-4o-mini'))
 5    ->addProvider('4o',   new OpenAI(model: 'gpt-4o'))
 6    ->addProvider('o1',   new OpenAI(model: 'o1'))
 7    ->setRule(
 8        (new DifficultyRule($scorer))
 9            ->outOfDomain('o1', coverage: 0.4)
10            ->easy('mini',  maxScore: 0.33)
11            ->medium('4o',  maxScore: 0.70)
12            ->hard('o1')
13    );

That is the entire integration. "What are your opening hours?" lands on the cheap model; "draft an NDA under Italian law" lands on the premium one -- automatically.

Two knobs, and how to turn them

Two things to tune, and you tune with data not intuition. Difficulty cut-offs (0.33, 0.70) decide where easy ends and hard begins. Coverage cut-off (0.4) decides how unfamiliar a prompt has to be before you escalate to the most capable model.

Log three things for real traffic: difficulty score, coverage value, and chosen provider. Adjust until the balance is right. Cheap answers coming back wrong? Lower the hard threshold. Out-of-domain prompts leaking to cheap tier? Raise the coverage cut-off.

Why this matters for PHP/Laravel shops

Most routing tools assume a Python service somewhere in your stack -- another process to monitor, another thing that can fail at 3 AM. For a Laravel team with a clean PHP pipeline and small ops surface, that is a hard sell.

The llm-classifier removes that friction entirely. It is a Composer package. It runs in the same process. Works under Octane, RoadRunner, FrankenPHP with no special config. No new infrastructure, no new runtime dependencies.

If 60% of your LLM traffic is cheap-model material and you were routing everything to GPT-4o, the classifier pays for itself in days, not months.

The takeaway

For a long time the answer to "which model should answer this?" in PHP was either static or a pile of keyword matching you maintained by hand. Now there is a measured answer that costs microseconds, comes from your own models, and drops into a router that was already part of the framework.

MIT licensed: github.com/neuron-core/llm-classifier