Reinforcement Learning Without Temporal Difference Learning

Introduction to a New Approach in Reinforcement Learning

In this article, I will introduce a reinforcement learning (RL) algorithm based on an alternative paradigm: divide and conquer. Unlike traditional methods, this algorithm does not rely on temporal difference (TD) learning, which has scalability challenges, and is well-suited for long-horizon tasks.

Context: Off-Policy Reinforcement Learning

Before delving into the details, let’s clarify what off-policy reinforcement learning entails. There are two main categories of algorithms in RL: on-policy and off-policy. On-policy learning requires the use of only recent data collected by the current policy, which means that older data must be discarded with each policy update. Algorithms like PPO and GRPO fall into this category. Off-policy learning, on the other hand, allows the use of any type of data, including past experiences, human demonstrations, and data from the internet. This makes off-policy learning more general and flexible, although it is also more complex. Q-learning is a well-known algorithm in this area.

In fields where data collection is expensive, such as robotics or dialogue systems, utilizing off-policy learning often becomes essential, making it a critical problem to solve. As of 2025, we have developed satisfactory solutions for scaling on-policy learning (like PPO and its variants), but we have yet to find a scalable off-policy algorithm suitable for complex, long-horizon tasks.

Paradigms in Value Learning: Temporal Difference and Monte Carlo

In off-policy RL, we typically train a value function using temporal difference learning, applying the Bellman update rule. The major challenge here is that the error in the next value propagates to the current value through bootstrapping, leading to error accumulation over the entire horizon. This dynamic makes TD learning difficult to apply to long-horizon tasks.

To mitigate this issue, researchers have mixed TD learning with Monte Carlo returns. For instance, n-step TD learning uses actual Monte Carlo returns for the first n steps, then applies the bootstrapped value for the rest. While this reduces the number of Bellman recursions, it does not fundamentally resolve the error accumulation issue. Additionally, a large n can result in high variance and suboptimality.

A Third Approach: Divide and Conquer

I argue that a third approach, divide and conquer, may provide an ideal solution for off-policy reinforcement learning, allowing for adaptation to arbitrarily long-horizon tasks. This method logarithmically reduces the number of Bellman recursions. The key idea is to split a trajectory into two equal-length segments and combine their values to update the value of the full trajectory. This approach theoretically decreases the number of recursions logarithmically, avoids the need to choose a hyperparameter like n, and does not suffer from high variance or suboptimality issues associated with n-step TD learning.

A Practical Algorithm

Recently, along with Aditya, we made significant progress in realizing this idea. We successfully scaled up divide-and-conquer value learning to complex tasks, particularly in the realm of goal-conditioned RL. This type of learning aims to create a policy capable of reaching any state from any other state, providing a natural structure for applying this method.

In a deterministic environment, the shortest path distance between two states s and g must adhere to the triangle inequality, which can be translated into a Bellman update rule. This means we can update the value of V(s, g) using two smaller values: V(s, w) and V(w, g), where w is the optimal midpoint. This process represents exactly the update rule we were seeking.

Conclusion

While this method shows exciting potential, a significant challenge remains: how to choose the optimal subgoal w. This question is still open and requires further research to be fully addressed. However, the divide-and-conquer paradigm could indeed transform reinforcement learning by providing a robust solution to scalability and complexity issues.

For any questions or to discuss these concepts further, feel free to Contact me.