【论文随笔】CABI
Double Check Your State Before Trusting It: Confidence-Aware Bidirectional Offline Model-Based Imagination NeurIPS 2022
Problem
The imagined transitions generated by learned dynamics may be inaccurate in offline model-based RL method, thus downgrading the performance.
Idea
Augment the offline dataset by using trained bidirectional dynamics models and rollout policies with double check. Conservatism is introduced by trusting samples that the forward model and backward model agree on.
Algo
- Bidirectional dynamics models training: Train an ensemble of bootstrapped probabilistic forward and reverse dynamics models $\hat{p}_ {\varphi}$ and $\hat{p}_ {\phi}$ by maximizing log-likelihood.
- Bidirectional rollout policies training: Model the rollout policy with CVAE by generating actions that stay within the span of the dataset based on current state. The forward and backward rollout policy are represented by two different CAVEs respectively.
- Conservative Data Augmentation: For foward check, evaluate the deviation of $\tilde{s}_ {t}$ (generate next state $\hat{s}_ {t+1}$ via foward dynamic model and trace back from $\hat{s}_ {t+1}$ with reverse dynamic model to obtain $\tilde{s}_ {t}$ ) from $s_{t}$, and trust $\hat{s}_ {t+1}$ if the state deviation is in the ascending top $k%$ of the mini-batch. Vice versa.
Observation
Plot the state distributions of generated samples by random policy, forward model, reverse model and CABI respectively.
Env & Baselines
D4RL(Androit, Mujoco)
Combine CABI with BCQ, TD3_BC, IQL and then compare to UWAC, CQL, BCQ, MOP, COMBO
Exp
- Performance on Adroit $\to$ Offline MBRL generally fail on Adroit due to narrow data distribution and accumulated error produced by actor. CABI improves the performance via bidirectional dynamics and double check.
- Ablation Study $\to$ the importance of double check, compare with forward/backward imagination, compare with other data augmentation methods.
- Performance on Mujoco $\to$ CABI is a powerful data augmentation method and can boost the performance of the model-free offline RL methods.
【论文随笔】CABI
https://jasonzhujp.github.io/2023/04/27/paper-rl-14/