TD-M(PC)\(^2\): Improving Temporal Difference MPC Through Policy Constraint

Haotian Lin¹^† Pengcheng Wang² Jeff Schneider¹ Guanya Shi¹

¹Carnegie Mellon University ²University of California, Berkeley
^†Indicates Corresponding Author

We present Temporal Difference Learning for Model Predictive Control with Policy Constraint (TD-M(PC)\(^2\)), a simple yet effective approach built on TD-MPC2 that allows a planning-based MBRL algorithm to better exploit complete off-policy data. Without introducing additional computational budget or need for environment-specific hyperparameter tuning, it seamlessly inherits desirable features of the \(\textit{state-of-the-art}\) pipeline and consistently improves its performance for continuous control problems. In complex 61-DoF locomotion tasks in `HumanoidBench`, TD-M(PC)\(^2\) achieves over 100% improvement in final average performance over the baseline.

Observations and Analysis

Temporal difference MPC combines model-based planning and value learning. It relies on the H-step lookahead policy \(\pi_H\) for high quality data collection, the planning procedure can be summarized as:

\( \\ \begin{split} \mu^, \sigma^ &= \arg\max_{\mu, \sigma} \mathbb{E}_{(a_t, a_{t+1}, \ldots, a_{t+H}) \sim \mathcal{N}(\mu, \sigma^2)}[G(s_t)] \\ G(s_t) &= \sum_{h=t}^{H-1} \gamma^h r(z_h, a_h) + \gamma^H \hat{V}(z_{t+H}) \\ \mathrm{s.t.} \quad z_{t+1} &= d(z_{t}, a_t) \end{split} \)

Where policy prior \(\pi\) and \(\hat{V}\) are acquired through standard SAC-style approximate policy iteration.

\(\textbf{Policy Mismatch}\): \(\pi_H\) for exploration and \(\pi\) for exploitation, leading to persistent value overestimation.

Complete off-policy data is used for TD-learning leads to out-of-distribution query and incurs substantial approximation error. Even though the H-step lookahead policy \(\pi_H\) is theoretically less sensitive to value approximation errors, a substantial of them are introduced and \(\textbf{accumulated}\) through policy iteration, leading to further divergence. As a result, naively applying policy iteration fails to fully exploit its potential.

Like many practices in Offline RL, we introduce constrainted policy iteration by adding a simple regularization term that push policy toward behavoir policy \(\mu\) in the buffer. This minimalist modification require less then 10 lines of code to implement on top of \(\textit{TD-MPC2}\).

\( \mathcal{L}_{\pi} = -\underset{a \sim \pi}{\mathbb{E}} \left[Q(s, a)-\alpha \log\pi(a | s)+\beta\log\mu(a | s)\right] \)

\(\textbf{Overestimation Bias}\): TD-MPC tend to have huge approximation errors especially in complex tasks; With policy constraint regularization added, we can effectively reduce value oversetimation.

Takeaways:

We observed persistent value overestimation that leads to poor performance for complex tasks.
Fundamental limitation: value approximation error caused by non-converging policy mismatch.
A simple yet effective policy regularization term could mitigate this by reducing out-of-distribution queries

Performance on DMControl. We report mean performance and 95% CIs of 3 random seeds across 7 high-dimensional continuous control tasks. We also present the average performance on all algorithms.

Ours

TD-MPC2

h1hand-run-v0

Ours

TD-MPC2

TD-M(PC)\(^2\): Improving Temporal Difference MPC Through Policy Constraint

Abstract

Observations and Analysis

\( \\ \begin{split} \mu^, \sigma^ &= \arg\max_{\mu, \sigma} \mathbb{E}_{(a_t, a_{t+1}, \ldots, a_{t+H}) \sim \mathcal{N}(\mu, \sigma^2)}[G(s_t)] \\ G(s_t) &= \sum_{h=t}^{H-1} \gamma^h r(z_h, a_h) + \gamma^H \hat{V}(z_{t+H}) \\ \mathrm{s.t.} \quad z_{t+1} &= d(z_{t}, a_t) \end{split} \)

\(\textbf{Policy Mismatch}\): \(\pi_H\) for exploration and \(\pi\) for exploitation, leading to persistent value overestimation.

\( \mathcal{L}_{\pi} = -\underset{a \sim \pi}{\mathbb{E}} \left[Q(s, a)-\alpha \log\pi(a | s)+\beta\log\mu(a | s)\right] \)

\(\textbf{Overestimation Bias}\): TD-MPC tend to have huge approximation errors especially in complex tasks; With policy constraint regularization added, we can effectively reduce value oversetimation.

Experiments and Qualitative Results

Performance on DMControl. We report mean performance and 95% CIs of 3 random seeds across 7 high-dimensional continuous control tasks. We also present the average performance on all algorithms.

h1hand-run-v0

h1hand-slide-v0

BibTeX

TD-M(PC)\(^2\): Improving Temporal Difference MPC Through Policy Constraint

Abstract

Observations and Analysis

\( \\ \begin{split} \mu^*, \sigma^* &= \arg\max_{\mu, \sigma} \mathbb{E}_{(a_t, a_{t+1}, \ldots, a_{t+H}) \sim \mathcal{N}(\mu, \sigma^2)}[G(s_t)] \\ G(s_t) &= \sum_{h=t}^{H-1} \gamma^h r(z_h, a_h) + \gamma^H \hat{V}(z_{t+H}) \\ \mathrm{s.t.} \quad z_{t+1} &= d(z_{t}, a_t) \end{split} \)

\(\textbf{Policy Mismatch}\): \(\pi_H\) for exploration and \(\pi\) for exploitation, leading to persistent value overestimation.

\( \mathcal{L}_{\pi} = -\underset{a \sim \pi}{\mathbb{E}} \left[Q(s, a)-\alpha \log\pi(a | s)+\beta\log\mu(a | s)\right] \)

\(\textbf{Overestimation Bias}\): TD-MPC tend to have huge approximation errors especially in complex tasks; With policy constraint regularization added, we can effectively reduce value oversetimation.

Experiments and Qualitative Results

Performance on DMControl. We report mean performance and 95% CIs of 3 random seeds across 7 high-dimensional continuous control tasks. We also present the average performance on all algorithms.

h1hand-run-v0

h1hand-slide-v0

BibTeX

\( \\ \begin{split} \mu^, \sigma^ &= \arg\max_{\mu, \sigma} \mathbb{E}_{(a_t, a_{t+1}, \ldots, a_{t+H}) \sim \mathcal{N}(\mu, \sigma^2)}[G(s_t)] \\ G(s_t) &= \sum_{h=t}^{H-1} \gamma^h r(z_h, a_h) + \gamma^H \hat{V}(z_{t+H}) \\ \mathrm{s.t.} \quad z_{t+1} &= d(z_{t}, a_t) \end{split} \)