Thinking Deeper, Not Just Wider: A Beginner's Guide to the Hierarchical Reasoning Model

Learn how the Hierarchical Reasoning Model (HRM) boosts AI reasoning. Explore why LLMs struggle and how HRM mimics the brain to solve complex problems.

Blog

Aug 2, 2025

Thinking Deeper, Not Just Wider: A Beginner's Guide to the Hierarchical Reasoning Model

If you've ever tried to get a Large Language Model (LLM) like ChatGPT & Gemini to solve a complex Sudoku puzzle or navigate a large maze, you might have noticed it fucks up.

It can talk about the rules, but when it comes to actually performing many steps of logical reasoning, it often fails. Why is that? Let’s understand the recent studies that have come out to solve this.

A recent paper, "Hierarchical Reasoning Model" (HRM), introduces a fascinating, brain-inspired model that tackles this very problem. It's a significant step towards AIs that can genuinely reason through complex tasks, not just match patterns.

Let's break down the core ideas, focusing on the "why" and "how" for students of AI.

The Glass Ceiling of Standard LLMs: Fixed Computational Depth

The first thing to understand is a fundamental limitation of standard Transformers (the architecture behind most LLMs). They have a fixed number of layers. Whether you give it a simple question or a multi-step logic puzzle, the data passes through the same number of processing layers.

Think of it like a calculator that only allows you to press 12 buttons before it spits out an answer. For 2+2, that's fine. But for a long, complex calculation, it's impossible.

This "fixed depth" has a major consequence: it means standard AI models are fundamentally limited in the types of problems they can solve.

They are not "Turing-complete." This is a key concept in computer science. A system is Turing-complete if it can solve any computable problem, given enough time and memory, just like your laptop can run any software. It can follow a long list of instructions or an algorithm.
LLMs can't do this. In a single go, an LLM can't follow a long, step-by-step recipe. It's more like a super-powerful calculator than a general-purpose computer. It's great at quick calculations and pattern matching but fails at tasks that need deep, sequential reasoning.

This is why new architectures are needed. We don't just need wider models; we need deeper thinkers.

HRM's Solution: A Brain-Inspired Hierarchy

The HRM tackles the depth problem not by stacking more layers, but by introducing a clever, recurrent hierarchy inspired by the brain's own organisation. It consists of two main parts that work together:

The Low-Level Module (L): This is the "fast thinker" or System 1. It's a recurrent neural network (RNN). The term "recurrent" is key—it means the module has connections that loop back on themselves. This recurrent connectivity allows it to maintain a memory and refine its calculations over multiple internal steps, converging on a local solution.
The High-Level Module (H): This is the "slow thinker" or System 2. It's also a recurrent module that takes the summarized output from the L-module, integrates it into a global plan, and provides top-down guidance for the L-module's next round of computation.

A great way to think about this is like a manager and their intern working on a complex project. The manager (the High-Level module) gives the intern (the Low-Level module) a specific sub-task, like "research these data points." The intern goes away, works intensely on that one task for a while, and comes back with a summary of their findings. The manager takes this summary, integrates it into the overall project plan, and then gives the intern a new, refined task based on the project's current status, like "Okay, based on that, now draft this section."

This process creates a nested loop: the L-module iterates many times within a single update of the H-module. This allows the HRM to achieve immense effective computational depth in a single forward pass, breaking free from the fixed-depth limitation of standard Transformers.

The Power of "Latent Reasoning"

This internal back-and-forth between the manager (H) and the intern (L) is a form of latent reasoning. Unlike "Chain-of-Thought," where an LLM writes out its thinking process step-by-step in plain English, latent reasoning happens entirely inside the model's brain i.e. hidden neural states, it's "silent thinking."

If you’re curious why Chain-of-Thought reasoning often fails to match true logical depth, check out ‘Chain-of-Thought Reasoning Mirage’ - it breaks down how surface-level step-by-step reasoning can mislead both humans and LLMs into thinking they’re “understanding.”

It's faster and more efficient. Generating text is slow. By reasoning in its own internal language of vectors, the model can perform many more steps of computation in the same amount of time.
It's potentially more robust. It avoids the brittleness of text-based reasoning, where a single poorly worded step can derail the entire process.

Think of this latent reasoning process like deblurring a photograph. Initially, the model's internal "thought" is a blurry image, it has a rough idea but no clear details. Each cycle of reasoning between the Low-Level and High-Level modules is like applying a sharpening filter. The Low-Level module does the fine-grained pixel work, and the High-Level module looks at the result and guides the next sharpening step. This continues until the image is sharp and clear, this is the "fixed point" or the final converged solution.

The goal of this latent reasoning is to find a stable internal state, a "fixed point", that represents the solution. The final answer is then generated from this converged state. And if you think about it, we in everyday life think like this. AI needs to understand language is just a communication medium; thinking should happen in the brain (or the hidden layers).

You can skip a lot of it, BUT I would prefer you atleast give it a try.

How Does It Learn? The Math Behind Latent Reasoning

So, the model "thinks" by letting its internal state bounce back and forth between the H and L modules until it stabilises. This stable state is called a fixed point. But how do you train a model like this? You can't use standard backpropagation through thousands of recurrent steps, it's computationally explosive.

This is where the paper uses a beautiful piece of mathematics: the Implicit Function Theorem (IFT).

The Fixed-Point Equation

First, let's represent the update process. The high-level state at step k, denoted z_H^k, is a function F of the previous state z_H^{k-1}, the input x̃, and the model's weights θ.

z_H^k = F(z_H^{k-1}; x̃, θ)

The model has finished "thinking" when the state stops changing—that is, it reaches a fixed point z_H⋆ where:

z_H⋆ = F(z_H⋆; x̃, θ)

The final answer depends on this stable, "converged" thought z_H⋆. To train the model, we need to know how to adjust the weights θ to make this final thought better. We need the gradient ∂z_H⋆ / ∂θ.

The IFT Shortcut

The Implicit Function Theorem provides a way to find this gradient without unrolling the entire computation. It gives us this exact formula:

∂z_H⋆ / ∂θ = (I - J_F)^-1 ∂F / ∂θ

Let's decode this:

∂F / ∂θ: This is the easy part. It's how the update rule directly changes when we tweak the weights.
J_F = ∂F / ∂z_H: This is the Jacobian. It's a matrix that tells us how sensitive the update rule is to a small change in the state itself. It captures the internal feedback loops of the system.
(I - J_F)^-1: This is the tricky part. It's a "correction factor" that accounts for all the ripple effects of the feedback. Changing the weights changes the state, which changes the next state, and so on. This term wraps all of that up.

The "1-Step Gradient" Approximation

Calculating that matrix inverse (I - J_F)^-1 is still very expensive. So, the authors use a common and effective approximation. The inverse can be expressed as an infinite series (the Neumann series):

(I - J_F)^-1 = I + J_F + J_F² + J_F³ + …

The 1-step gradient trick is to approximate this whole series with just its first term: I.

(I - J_F)^-1 ≈ I

This simplifies the gradient calculation enormously. It's like saying, "Let's ignore the complex feedback ripples and just consider the most direct effect of changing the weights." It's much faster, uses less memory, and, in practice, works very well for training these kinds of equilibrium models.

Thinking Fast and Slow: The Adaptive Halting Mechanism

Not all problems require the same amount of thought. HRM implements an adaptive halting mechanism that allows it to "think" for longer on harder problems and stop early on easier ones. It learns this skill using Q-learning, a classic reinforcement learning algorithm.

Here’s how it works:

The Q-Head: After each high-level update cycle (a "segment" of thought), a small network called the Q-head looks at the current high-level state z_H and predicts the value of two actions: Q_halt and Q_continue.
The Decision: The model halts if two conditions are met:
- It has completed a minimum number of thinking steps (a randomized threshold, M_min, to encourage exploration).
- The predicted value of stopping is greater than the value of continuing (Q_halt > Q_continue).
The Reward: The model learns the Q-values by receiving a reward. If it halts and gets the answer right, it gets a reward of +1. If it continues, it gets a reward of 0. Over time, the Q-head learns to accurately predict when it's best to stop.

Remarkably, this Q-learning process is stable in HRM without needing the usual stabilization tricks (like replay buffers). This is because the model's architecture, which uses RMSNorm and the AdamW optimizer, inherently keeps the model's weights bounded, preventing the instability that can plague RL.

An Emergent Hierarchy: The AI's "Brain" Organizes Itself

One of the most profound findings is that the HRM doesn't just mimic the brain's structure; it learns to mimic its organizational principles.

In neuroscience, there's a known dimensionality hierarchy. Higher-order brain regions responsible for abstract reasoning have more complex, flexible, and thus higher-dimensional neural representations. Lower-level sensory regions have simpler, lower-dimensional representations.

The paper shows that the HRM develops the exact same property through training:

Trained HRM: The high-level module (H) develops neural representations with a much higher effective dimensionality than the low-level module (L). It learns to use a richer, more flexible "mental workspace" for planning.
Untrained HRM: In a model with random weights, both modules have the same low dimensionality.

This tells us something crucial: this functional hierarchy is an emergent property of learning to solve complex reasoning tasks. The model isn't just given a structure; it learns how to use that structure in a way that mirrors the functional organization of a biological brain.

Conclusion

The Hierarchical Reasoning Model offers a compelling path forward for AI. By moving beyond the shallow, fixed-depth paradigm of standard Transformers, it embraces the power of recurrence, hierarchy, and adaptive computation. It shows that by building models with more brain-like principles, we can create systems that don't just regurgitate information, but can learn to reason, plan, and solve problems with a depth that was previously out of reach.

And if you want to see how such reasoning depth can evolve through multi-agent communication, don’t miss ‘Agent-to-Agent (A2A) Protocol Explained’ — where we explore how AI models can collaborate, debate, and refine answers through structured dialogue loops.

You teach LLM what to answer, while you teach HRM on how to get to the answer.