【论文随笔】CABI

Double Check Your State Before Trusting It: Confidence-Aware Bidirectional Offline Model-Based Imagination NeurIPS 2022

Problem

The imagined transitions generated by learned dynamics may be inaccurate in offline model-based RL method, thus downgrading the performance.

Idea

Augment the offline dataset by using trained bidirectional dynamics models and rollout policies with double check. Conservatism is introduced by trusting samples that the forward model and backward model agree on.

Algo

Bidirectional dynamics models training: Train an ensemble of bootstrapped probabilistic forward and reverse dynamics models $\hat{p}_ {\varphi}$ and $\hat{p}_ {\phi}$ by maximizing log-likelihood.
Bidirectional rollout policies training: Model the rollout policy with CVAE by generating actions that stay within the span of the dataset based on current state. The forward and backward rollout policy are represented by two different CAVEs respectively.
Conservative Data Augmentation: For foward check, evaluate the deviation of $\tilde{s}_ {t}$ (generate next state $\hat{s}_ {t+1}$ via foward dynamic model and trace back from $\hat{s}_ {t+1}$ with reverse dynamic model to obtain $\tilde{s}_ {t}$ ) from $s_{t}$, and trust $\hat{s}_ {t+1}$ if the state deviation is in the ascending top $k%$ of the mini-batch. Vice versa.

Observation

Plot the state distributions of generated samples by random policy, forward model, reverse model and CABI respectively.

Bidirectional modeling with the double check mechanism successfully produces reliable and conservative synthetic samples.

Env & Baselines

D4RL(Androit, Mujoco)
Combine CABI with BCQ, TD3_BC, IQL and then compare to UWAC, CQL, BCQ, MOP, COMBO

Exp

Performance on Adroit $\to$ Offline MBRL generally fail on Adroit due to narrow data distribution and accumulated error produced by actor. CABI improves the performance via bidirectional dynamics and double check.
Ablation Study $\to$ the importance of double check, compare with forward/backward imagination, compare with other data augmentation methods.
Performance on Mujoco $\to$ CABI is a powerful data augmentation method and can boost the performance of the model-free offline RL methods.

学术

#RL #Offline #Algorithm

【论文随笔】CABI

https://jasonzhujp.github.io/2023/04/27/paper-rl-14/

作者

Jason Zhu

发布于

2023年4月27日

许可协议

【论文随笔】AMPL 上一篇

【论文随笔】MOPP 下一篇