The hottest topic in AI right now is, without question, agentic AI. Watching a model plan its own steps, call tools, and chain together multi-step tasks naturally raises a question: is this model actually reasoning, or is it just fluently imitating the outputs of reasoning?
I think this is a question worth taking seriously. The autonomy of these agents evokes the image of a “reasoning machine,” but the algorithm running underneath is still next-token prediction (NTP) — a probabilistic guess at what word comes next. In this post I want to sort out my own thinking on whether LLMs will, eventually, be capable of reasoning.
What Is Reasoning?
Before deciding whether LLMs reason, we need to be clear about what the word reasoning actually points to. In casual conversation we tend to call any “smart-sounding” answer reasoning, but in cognitive science and logic it has a fairly precise meaning. I think the cleanest starting point is to distinguish three things.
First, recall. This is the retrieval of facts that already exist in the training data. Answering “Augustine Washington” to “Who is George Washington’s father?” is recall. No new fact is produced.
Second, pattern matching. This is producing the most likely output for an input whose surface form resembles something in the training data. NTP excels at this. It produces plausible answers when the surface form is familiar, but breaks down when the surface is perturbed even slightly.
Third, reasoning itself. Reasoning is the ability to derive an unstated fact from facts and rules already in hand. If you can take “A’s father is B” and “father–son is a reciprocal relation” and derive “B’s son is A,” that is reasoning. Deduction, induction, and analogy all sit inside this category.
These three are easy to confuse. Recall and pattern matching can produce outputs that look like reasoning, especially on familiar inputs. But they are different operations underneath, and that difference is the key to understanding the limits of current LLMs.
What Does It Mean to Reason Well?
So what would it mean for a system to reason well? I think four properties matter most.
First, robustness to surface change. Asking the same question in different words, swapping names, or rephrasing the problem should not change the conclusion. Reasoning operates on the structure beneath the surface, not on the surface itself. Apple’s GSM-Symbolic study, which showed that LLM accuracy on math problems drops significantly when only the numbers are changed, is a signal that current models are weak on this axis.
Second, the ability to handle combinations the model has never seen before. Real reasoning means applying known rules to new situations rather than recalling stored cases. This is usually called compositional generalization.
Third, stability under depth. If a one-step inference is correct, the next step should be able to build on it, and the conclusion should not collapse simply because the chain got longer.
Fourth, knowing what one knows. Genuine reasoning distinguishes between conclusions that can and cannot be derived from the available premises, and stays silent — or admits ignorance — about the latter.
With these four criteria in place, the otherwise hand-wavy question “does the LLM reason?” becomes something we can actually test.
Where Do Today’s LLMs Stand on Reasoning Benchmarks?
So how do current LLMs measure up against these criteria? Helpfully, the field has built a wide range of benchmarks aimed at reasoning, and they let us compare progress with at least some objectivity.
A few of the most-cited ones:
- MMLU: undergraduate-level multiple choice across many subjects. Already saturated near 90%.
- GPQA Diamond: PhD-level science questions written by domain experts. Top models in 2025–2026 reach the 80% range.
- MATH and competition math benchmarks like AIME: with the rise of reasoning-tuned models (o1/o3, Claude Opus 4.x, Gemini 3), accuracy has climbed quickly into the 80–90% range.
- ARC-AGI: François Chollet’s abstract pattern-reasoning benchmark. OpenAI o3 hit the 80s on v1 in high-compute mode, but on the harder v2 even top models sit in the single digits to low teens.
- HLE (Humanity’s Last Exam): released in early 2025, drawing graduate-level questions across a broad range of expert disciplines. Frontier models started in the low teens at launch and, about a year later, remain in the high 30% range — still well below human-expert performance.
- SWE-bench: solving real GitHub issues, a test of practical coding reasoning. Top agents only recently crossed into the 70% range.
The picture from these benchmarks is two-faced. On one hand, problem types that were genuinely hard a few years ago — undergraduate-level question answering, contest math, expert science — are saturating, with top models approaching human-expert performance. On the other hand, surface-perturbed problems like GSM-Symbolic, novel-combination problems like ARC-AGI v2, and breadth-and-depth problems like HLE still cause the same models to fall apart.
Put another way, today’s LLMs perform impressively where the surface looks like the training data, but on all four of the criteria above — surface robustness, compositional generalization, depth stability, self-awareness — they fall short of the bar. Climbing benchmark scores does not automatically mean the underlying capability is being acquired. The next two sections look at where this gap actually comes from.
LLMs Memorize Patterns, Not Relations
The best-known piece of evidence that NTP is not learning reasoning is the Reversal Curse (Berglund et al., 2023).
Ask a model “Who is George Washington’s father?” and it answers “Augustine Washington” instantly. Ask the reverse — “Who is Augustine Washington’s son?” — and it either draws a blank or names someone else entirely. Berglund and colleagues showed the same pattern with Tom Cruise’s mother, Mary Lee Pfeiffer: a model that has clearly memorized the relation in one direction fails on the question “Who is Mary Lee Pfeiffer’s son?”
If the model had genuinely internalized the relation “parent–child,” it would hold in both directions: if A’s father is B, then B’s son is A. NTP-based models, however, memorize the order in which a sentence frequently appeared in their training data, without inferring the underlying logic. They treat two surface phrasings of the same fact as two different facts.
Measured against the definition of reasoning above, this is the simplest possible deduction — a one-step inversion — and the model cannot perform it. The model can imitate the outputs of reasoning, but it does not perform the operation. The fact that benchmark scores hover at 80–90% does not change the picture: those scores measure something, but the simultaneous existence of failures like this suggests that what they measure does not line up cleanly with reasoning as we have defined it.
The Distance Between Word and World: The Symbol Grounding Problem
The Reversal Curse is one face of a deeper issue. The classic cognitive-science puzzle Stevan Harnad raised in 1990 — the Symbol Grounding Problem — explains why these mistakes recur.
The core idea is this: a model that uses the words “father” and “son” correctly is not necessarily a model that understands what parents and children are. To the model, a word is a statistical pattern that co-occurs with other words. It is not anything in the world.
Humans can reason because we run simulations against a world model that sits behind our words. The word “father” pulls in a rich web of causal facts — a parent precedes their child in time; parent and child mutually entail each other; if X is Y’s father, then Y is X’s child — so “Who is Augustine Washington’s son?” does not even cause a hesitation. To an LLM, “father” is just a vector of distances to other tokens.
As Yann LeCun argues in A Path Towards Autonomous Machine Intelligence, intelligence stripped of causal structure risks becoming a sophisticated game played with unanchored symbols. Arranging words well is one capability; understanding the world the words point to is another. NTP is essentially optimized for the first.
A Step Beyond NTP: JEPA
So how might we get past this wall? One of the most discussed alternatives right now is JEPA (Joint-Embedding Predictive Architecture).
The idea behind JEPA is this: instead of trying to generate every detail of the data, predict the abstract relationships underneath it. Where NTP tries to reproduce every next token, JEPA discards token-level detail and learns the system’s core states and how they transition.
I find this idea structurally similar to abstraction in software engineering. Good code does not become robust by memorizing every implementation detail; it becomes robust by defining clean interfaces and relations between objects. By the same logic, a good intelligence model is not one that memorizes every pixel and word, but one that learns the structure beneath them. JEPA is an attempt to redefine intelligence as the comprehension of structural relationships rather than the sum of probabilities.
It is too early to call JEPA the answer. But the more clearly we see the wall NTP has hit — failure to infer relations, failure to ground words in the world — the clearer it gets that climbing past it is unlikely to be a matter of simply scaling further. The next step probably lies in a different learning objective, not in more of the same.
Closing: Intuition Engine vs. Logic Engine
Let me return to the original question: will LLMs eventually be capable of reasoning?
I suspect we are in the middle of a kind of streetlight effect. The fact that the area beneath the lamppost is brightly lit does not mean the lost keys are there. The fact that an LLM produces fluent text does not prove a reasoning engine is inside it. We may simply be searching where it is easy to measure — token-prediction accuracy — rather than where the thing we want actually lives.
From an engineer’s perspective, today’s LLMs are unmistakably excellent intuition engines. The ability to produce a plausible next word over a vast space of patterns is impressive in its own right and enables a great deal of useful work. The rapid climb in benchmark scores reflects steadily increasing resolution of that intuition. But calling them logic engines would be premature: the Reversal Curse, the Symbol Grounding Problem, and the cracks visible in ARC-AGI v2 and GSM-Symbolic all suggest there is much left to verify. None of the four criteria laid out above — surface robustness, compositional generalization, depth stability, self-awareness — has been fully met yet.
My tentative answer to whether LLMs will eventually reason is this: not by sharpening the resolution of NTP alone. The next step is an architectural question — how do we ground intelligence in the causal structure of the world? A machine that genuinely reasons will not be one that strings words together well, but one that can simulate the world the words point to.