
The AI industry is obsessed with throughput. How fast can a model generate tokens? How cheaply? How many can fit in a context window? These are the questions animating billions in investment and thousands of research papers. They’re also the wrong questions.
Token generation was never the hard part. Thinking is.
Einstein, when asked how he arrived at the theory of relativity, was clear about his process: “The words or the language, as they are written or spoken, do not seem to play any role in my mechanism of thought.” He reasoned in images, muscular sensations, spatial relationships. Words came last — a translation layer between thought and communication, not the medium of thought itself.
Now consider how large language models work. They’re trained to predict the next token. Every “thought” a model has — every inference, every logical step, every piece of reasoning — happens through the act of producing language. There is no latent deliberation that precedes the output. The output is the deliberation. Thinking and speaking are architecturally fused.
This is not a minor implementation detail. It’s a fundamental constraint.
Chain-of-thought is a workaround, not a solution
When researchers discovered that asking models to “think step by step” improved performance, it felt like a breakthrough. And it was — within the existing architecture. But chain-of-thought prompting is essentially a hack: we’re forcing the model to externalise its reasoning in natural language because there’s nowhere else for reasoning to happen.
The model isn’t thinking and then writing. It’s writing as a way of thinking. Those are very different things.
A human mathematician solving a hard problem doesn’t narrate their reasoning aloud in real time. They work in their head, in symbols, in intuitions that resist easy verbalisation. Only when they reach a conclusion do they translate it into language. The narration follows the thought. In current LLMs, it is the thought.
Chain-of-thought prompting is the equivalent of asking Einstein to figure out relativity by saying each word out loud as it occurs to him. It would not have worked. We wouldn’t have GPS — which depends on relativistic corrections to atomic clocks to give you accurate directions — or nuclear energy, which exists because E=mc² describes how a small amount of mass becomes an enormous amount of power. Because he would have been constrained by the structure of language itself — not the structure of the problem.
What we’ve actually optimised
The last three years of AI progress have been remarkable. But when you look at where the investment has gone, a pattern emerges: we’ve optimised the output layer.
Faster inference. Cheaper tokens. Longer context windows. More efficient quantisation. Better serving infrastructure. All of these make it faster and cheaper to generate tokens. None of them make the reasoning better. They make the same not-actually-reasoning happen faster.
This is the expensive mistake: confusing throughput with intelligence. You can generate a million tokens of mediocrity per second and have nothing worth reading. You can generate ten tokens of genuine insight and have changed someone’s mind. The industry’s benchmark obsession — MMLU, HumanEval, GPQA — measures what models produce in language, not the quality of the reasoning that led to those outputs.
Goodhart’s Law, applied to AI: when a measure becomes a target, it ceases to be a good measure. We’ve made token quality the implicit target, optimised for it relentlessly, and are now surprised that models still hallucinate, still fail at multi-step reasoning, still make mistakes a careful human wouldn’t.
The architecture most people aren’t discussing
There’s a research direction that deserves more mainstream attention: latent space reasoning.
Some newer approaches attempt to decouple reasoning from language generation. Instead of forcing a model to think in words, they allow processing to occur in the model’s internal representational space — the high-dimensional latent space where semantic relationships live — before surfacing language at the end. The model deliberates in embeddings, not tokens.
This is closer to how Einstein described his own cognition. Reasoning first. Translation second.
Early work in this direction — diffusion-based language models, models that iterate over internal representations before generating output — suggests this isn’t purely theoretical. But it’s not where the money is. The money is in making transformer-based token predictors faster and cheaper.
The counter-cycle question is whether that’s the right bet.
The bottleneck hiding in plain sight
Everyone is arguing about AI ethics, job displacement, existential risk. These are very important debates. But they’re taking place largely in the context of the current architecture — as if the way we’re building AI now is fixed, and all that remains is to decide how to use it.
A bigger, structural question is more fundamental: are we building systems that reason in the wrong medium?
Language is a lossy compression of thought. It’s a protocol humans developed to share ideas between minds, not to have ideas within one. When we train models to think exclusively in that protocol, we’re not just accepting a constraint — we’re enshrining it. We’re building systems that can only reason at the speed and resolution of natural language.
That might be fast enough. For many applications — AI in education among them — a model that thinks in tokens and produces fluent text is genuinely useful. But if we want systems that reason about hard problems the way hard problems actually require — in abstract, pre-linguistic space — then generating tokens faster isn’t the answer.
We’ve optimised brilliantly for the wrong bottleneck.
Leave a Reply