【论文随笔】TATU

Uncertainty-driven Trajectory Truncation for Model-based Offline Reinforcement Learning

Problem

The generated samples from the trained dynamics model are not completely reliable.

Adaptively truncates the synthetic trajectory if the accumulated uncertainty along the trajectory is too large.

Derive the performance bound for $\varepsilon$-Pessimistic MDP w.r.t. real MDP.
- Illustrates the sub-optimality of the policy trained with the $\varepsilon$-Pessimistic MDP.
- A further simplified upper bound of the sub-optimality if the dataset is large enough.

$\underline{line3}$: Train an ensemble of N dynamics models via maximum log-likelihood.
$\underline{line4}$: [Construct $\varepsilon$-Pessimistic MDP] Set the uncertainty threshold based on the maximum transition uncertainty in the dataset.
$\underline{line5}$: (OPTIONAL) If incorporated TATU with model free methods, an additional rollout policy is required, which is constructed by CVAE to guarantee that the generated actions lie in the span of dataset.
$\underline{line12}$: [Construct $\varepsilon$-Pessimistic MDP] Reward is additionally penalized in terms of uncertainty.
$\underline{line13}$: [Construct $\varepsilon$-Pessimistic MDP] Use an MOPO-style uncertainty estimator.
$\underline{line14 \sim 18}$: Generate conservative trajectory by truncating trajectory whose accumulated uncertainty exceeds threshold.

D4RL (Mujoco)
Backbones: MOPO, COMBO, CQL, TD3_BC, BCQ
Baselines: backbone and MOReL, BC, IQL, DT

Combine with model-based offline RL (TATU+MOPO, TATU+COMBO, baselines) $\to$ TATU markedly boosts the performance of base methods on most of the datasets.
Combine with model-free offline RL (TATU+CQL, TATU+BCQ, TATU+TD3_BC, baselines) $\to$ TATU significantly improves the performance of the base methods on many datasets, especially on poor-quality datasets.
Parameter study $\to$ Rollout horizon $h$, Threshold coefficient $\alpha$, Real data ratio $\eta$.

学术

#RL #Offline #Algorithm

【论文随笔】TATU

https://jasonzhujp.github.io/2023/04/13/paper-rl-09/

作者

Jason Zhu

发布于

2023年4月13日

许可协议