← Technology
Technology

Mirror or Engine: What LLM Reasoning Actually Is

The logical thresholds that next-token prediction runs into

One of the most heated questions in AI right now is also one of the oldest: “Is the machine actually reasoning, or is it fluently imitating the outputs of reasoning?” Agentic AI systems that draft their own plans look like autonomous intelligence—but the algorithmic constraints running underneath raise a reasonable doubt: have we built an engine of intelligence, or merely a mirror that reflects the traces intelligence has left behind?

1. Next-Token Prediction and the Reversal Curse

The foundation of today’s LLMs is next-token prediction (NTP): maximizing statistical co-occurrence across massive text corpora. Whether that process encodes genuine logical structure is a different question—and the evidence is getting skeptical. A striking example is the Reversal Curse.

Example 1: The asymmetry of family relations

Ask a model “Who is George Washington’s father?” and it answers “Augustine Washington” instantly. Ask the reverse—“Who is Augustine Washington’s son?”—and it either draws a blank or hallucinates a name. The same pattern shows up with Tom Cruise’s mother (Mary Lee Pfeiffer) and many other pairs.

If the model had truly internalized the logical relation “parent–child,” that relation should hold in both directions. Instead, NTP-based models memorize the directional pattern that appeared most often in training data. They recall a surface form; they do not infer the underlying structure.

2. The 60% Trap: How Errors Compound

The biggest bottleneck for agentic AI on complex tasks is not any single mistake—it is the accumulation of mistakes.

Example 2: The unreliable GPS

Imagine driving an unfamiliar route that requires 10 turns. Your GPS is excellent: 95% accurate at each intersection. That sounds impressive. But the probability of navigating all 10 turns correctly is 0.9510 ≈ 0.60—a coin flip, roughly.

This is precisely what Planning Drift (Valmeekam et al., 2023) describes. A human driver who takes a wrong turn at step three recognizes the error and logically corrects course. A probabilistic model, by contrast, keeps generating “the next plausible token” even after a wrong turn—because there is no internal verification mechanism that checks the global coherence of the plan. It drifts, confidently, in the wrong direction.

3. The Grounding Problem: A Bowling Ball on Cotton Candy

The classic cognitive-science puzzle of the Symbol Grounding Problem helps explain why LLMs make mistakes that feel almost absurd to humans.

Example 3: Missing physical common sense

Ask a model “What happens if you place a large bowling ball on a small piece of cotton candy?” and, following statistical proximity between the words, it might describe the bowling ball sitting prettily on top of the cotton candy.

A person simulates this scenario instantly through what cognitive scientists call a World Model: bowling ball = heavy; cotton candy = fragile → cotton candy collapses. An LLM computes probabilistic distances between tokens. It has no access to the weight of a bowling ball or the structural weakness of spun sugar—those physical properties are never “grounded” in the symbols. As Yann LeCun argued in “A Path Towards Autonomous Machine Intelligence,” intelligence that lacks physical causality risks becoming a sophisticated game with unanchored symbols.

4. JEPA: Learning Structure, Not Surface

JEPA (Joint-Embedding Predictive Architecture) is one of the most discussed alternatives to generative NTP. Instead of learning to reproduce every detail of the data, JEPA learns to predict the abstract relational structure beneath it.

Example 4: Drawing a bicycle vs. riding one

A generative model trying to reconstruct every pixel of a bicycle might end up drawing three wheels. JEPA instead learns “pedaling turns the chain, which rotates the wheel”—the causal relationship between components.

JEPA discards fine-grained noise and focuses on predicting meaningful state transitions in a system. This resonates with good software-engineering practice: robust systems are designed around interfaces and object relationships, not the implementation details inside each component. Treating intelligence as “comprehension of structural relationships” rather than “a sum of probabilities” is the animating hypothesis behind this line of research—and it may be the clearest path over the wall that NTP has run into.

Closing: The Streetlight Effect in AI

We may be living through what psychologists call the Streetlight Effect: searching for lost keys under the lamppost because that is where the light is, not because that is where the keys are. A language model producing fluent sentences does not prove there is a reasoning engine inside—it proves there is a very bright lamp.

From an engineering standpoint, current AI is a superb intuition engine and a questionable logic engine. The field has spent years improving the resolution of probability through scaling laws. The harder, and more important, question is architectural: how do we ground intelligence in the causal structure of the real world? That is the frontier that scaling alone cannot reach.