The Cost Of Intelligence Isn’t Going Up, Your Expectations Are
The Cost Of Intelligence Isn’t Going Up, Your Expectations Are
Did you know that GPT-5.5 is better than GPT-5.4? Sorry, let me rephrase that: Did you know GPT-5.5 with medium reasoning is a better model and uses far fewer tokens to do a task than the maxed-out GPT-5.4 xhigh?

The dots on each line are GPT-5.4 and GPT-5.5 solving a complex task at no, low, medium, high, and xhigh reasoning effort. As you can see the models score higher on benchmarks while using fewer tokens.
When people look at benchmarks they tend to make sweeping judgments based on a single number. SWE-Bench went up 2%, so the model’s a flop. GDPval jumped 20%, so jobs are cooked. But that’s not how people or organizations should think about model improvements.
This pattern I described with GPT-5.4 and GPT-5.5 has been happening with every model for over three years now. Extrapolate that out and the amount you pay for premium intelligence on demand keeps going down, not up. Sounds like a great deal to me.
Step Into A Time Machine
I recently had to use GPT-4.1 for a task and was reminded that the state-of-the-art model I was using a year ago is practically dumb as rocks now. The amount of intelligence at my disposal has gone up, and so have my expectations for what I can do with it. That’s why I don’t agree when people say there hasn’t been much progress at the bleeding edge, or that these models have gotten dramatically more expensive to use.
To prove the point, try these two experiments:
- Take a recent task you solved with a modern model like GPT-5.5 or Opus 4.8 — preferably something fairly complex — and hand it to an older model like Claude Sonnet 4. You’ll watch a year-old state-of-the-art model struggle, because what you consider a complex task has gotten more ambitious too.
- Instead of running a task with xhigh reasoning, try low or medium and see if it works just as well. It probably will unless you’re solving a tough coding challenge — and you’ll get the same result faster and cheaper.
The best benchmark we have isn’t any single number — it’s the passage of time.
Comments
Loading comments from Bluesky...