【论文随笔】CABI

Double Check Your State Before Trusting It: Confidence-Aware Bidirectional Offline Model-Based Imagination   NeurIPS 2022

Problem

The imagined transitions generated by learned dynamics may be inaccurate in offline model-based RL method, thus downgrading the performance.

Idea

Augment the offline dataset by using trained bidirectional dynamics models and rollout policies with double check. Conservatism is introduced by trusting samples that the forward model and backward model agree on.

Algo

  1. Bidirectional dynamics models training: Train an ensemble of bootstrapped probabilistic forward and reverse dynamics models $\hat{p}_ {\varphi}$ and $\hat{p}_ {\phi}$ by maximizing log-likelihood.
  2. Bidirectional rollout policies training: Model the rollout policy with CVAE by generating actions that stay within the span of the dataset based on current state. The forward and backward rollout policy are represented by two different CAVEs respectively.

  3. Conservative Data Augmentation: For foward check, evaluate the deviation of $\tilde{s}_ {t}$ (generate next state $\hat{s}_ {t+1}$ via foward dynamic model and trace back from $\hat{s}_ {t+1}$ with reverse dynamic model to obtain $\tilde{s}_ {t}$ ) from $s_{t}$, and trust $\hat{s}_ {t+1}$ if the state deviation is in the ascending top $k%$ of the mini-batch. Vice versa.


Observation

Plot the state distributions of generated samples by random policy, forward model, reverse model and CABI respectively.


Bidirectional modeling with the double check mechanism successfully produces reliable and conservative synthetic samples.

Env & Baselines

D4RL(Androit, Mujoco)
Combine CABI with BCQ, TD3_BC, IQL and then compare to UWAC, CQL, BCQ, MOP, COMBO

Exp

  • Performance on Adroit $\to$ Offline MBRL generally fail on Adroit due to narrow data distribution and accumulated error produced by actor. CABI improves the performance via bidirectional dynamics and double check.
  • Ablation Study $\to$ the importance of double check, compare with forward/backward imagination, compare with other data augmentation methods.
  • Performance on Mujoco $\to$ CABI is a powerful data augmentation method and can boost the performance of the model-free offline RL methods.

【论文随笔】CABI
https://jasonzhujp.github.io/2023/04/27/paper-rl-14/
作者
Jason Zhu
发布于
2023年4月27日
许可协议