TD-M(PC)\(^2\): Improving Temporal Difference MPC Through Policy Constraint

1Carnegie Mellon University 2University of California, Berkeley
Indicates Corresponding Author
Paper Summary Code arXiv
Description of image

We present Temporal Difference Learning for Model Predictive Control with Policy Constraint (TD-M(PC)\(^2\)), a simple yet effective approach built on TD-MPC2 that allows a planning-based MBRL algorithm to better exploit complete off-policy data. Without introducing additional computational budget or need for environment-specific hyperparameter tuning, it seamlessly inherits desirable features of the \(\textit{state-of-the-art}\) pipeline and consistently improves its performance for continuous control problems. In complex 61-DoF locomotion tasks in HumanoidBench, TD-M(PC)\(^2\) achieves over 100% improvement in final average performance over the baseline.

Hopper-stand
Humanoid-run
h1hand-balance_simple-v0
h1hand-crawl-v0
h1hand-pole-v0
h1hand-slide-v0

Abstract

Model-based reinforcement learning algorithms that combine model-based planning and learned value/policy prior have gained significant recognition for their high data efficiency and superior performance in continuous control. However, we discover that existing methods that rely on standard SAC-style policy iteration for value learning, directly using data generated by the planner, often result in \( \textit{persistent value overestimation}\). Through theoretical analysis and experiments, we argue that this issue is deeply rooted in the structural policy mismatch between the data generation policy that is always bootstrapped by the planner and the learned policy prior. To mitigate such a mismatch in a minimalist way, we propose a policy regularization term reducing out-of-distribution (OOD) queries, thereby improving value learning. Our method involves minimum changes on top of existing frameworks and requires no additional computation. Extensive experiments demonstrate that the proposed approach improves performance over baselines such as TD-MPC2 by large margins.

Observations and Analysis

Temporal difference MPC combines model-based planning and value learning. It relies on the H-step lookahead policy \(\pi_H\) for high quality data collection, the planning procedure can be summarized as:

\( \\ \begin{split} \mu^*, \sigma^* &= \arg\max_{\mu, \sigma} \mathbb{E}_{(a_t, a_{t+1}, \ldots, a_{t+H}) \sim \mathcal{N}(\mu, \sigma^2)}[G(s_t)] \\ G(s_t) &= \sum_{h=t}^{H-1} \gamma^h r(z_h, a_h) + \gamma^H \hat{V}(z_{t+H}) \\ \mathrm{s.t.} \quad z_{t+1} &= d(z_{t}, a_t) \end{split} \)

Where policy prior \(\pi\) and \(\hat{V}\) are acquired through standard SAC-style approximate policy iteration.

Animated GIF

\(\textbf{Policy Mismatch}\): \(\pi_H\) for exploration and \(\pi\) for exploitation, leading to persistent value overestimation.

Complete off-policy data is used for TD-learning leads to out-of-distribution query and incurs substantial approximation error. Even though the H-step lookahead policy \(\pi_H\) is theoretically less sensitive to value approximation errors, a substantial of them are introduced and \(\textbf{accumulated}\) through policy iteration, leading to further divergence. As a result, naively applying policy iteration fails to fully exploit its potential.

Like many practices in Offline RL, we introduce constrainted policy iteration by adding a simple regularization term that push policy toward behavoir policy \(\mu\) in the buffer. This minimalist modification require less then 10 lines of code to implement on top of \(\textit{TD-MPC2}\).

\( \mathcal{L}_{\pi} = -\underset{a \sim \pi}{\mathbb{E}} \left[Q(s, a)-\alpha \log\pi(a | s)+\beta\log\mu(a | s)\right] \)

Description of image

\(\textbf{Overestimation Bias}\): TD-MPC tend to have huge approximation errors especially in complex tasks; With policy constraint regularization added, we can effectively reduce value oversetimation.

Takeaways:

  • We observed persistent value overestimation that leads to poor performance for complex tasks.
  • Fundamental limitation: value approximation error caused by non-converging policy mismatch.
  • A simple yet effective policy regularization term could mitigate this by reducing out-of-distribution queries

Experiments and Qualitative Results

Description of image

Performance on DMControl. We report mean performance and 95% CIs of 3 random seeds across 7 high-dimensional continuous control tasks. We also present the average performance on all algorithms.

Ours
TD-MPC2

h1hand-run-v0

Ours
TD-MPC2

h1hand-slide-v0

BibTeX


      @article{lin2025td,
        title={TD-M (PC) $\^{} 2$: Improving Temporal Difference MPC Through Policy Constraint},
        author={Lin, Haotian and Wang, Pengcheng and Schneider, Jeff and Shi, Guanya},
        journal={arXiv preprint arXiv:2502.03550},
        year={2025}
      }