【论文随笔】EDAC

Uncertainty-Based Offline Reinforcement Learning with Diversified Q-Ensemble NeurIPS 2021

Problem

Prior methods typically require accurate estimation of the behavior policy or sampling from OOD data points, which are hard.
Prior methods under-utilize the generalization ability and often fall into suboptimal solutions too close to the given dataset. (over conservative)

Idea

Propose an uncertainty-based, ensemble-diversified actor-critic offline RL method with clipped Q-learning, which invloves confidence of the Q-value prediction and does not require estimation or sampling of the data distribution.

Observation

Clipped Q-learning can be leveraged to successfully penalize OOD data points with high prediction uncertainties. The size of the penalty and the standard deviation are highly correlated.
It is useful to increase the number of Q-networks along with the clipped Q-learning for better performance.

Theory

Clipped Q-learning can be either interpreted as penalizing state-action pairs with high-variance Q-value estimates, or utilizing the lower-confidence bound of the Q-value predictions. Theoratically, using the clipped Q-value is similar to penalizing the ensemble mean of the Q-values with the standard deviation scaled by a coefficient dependent on N.
The alignment of the input gradients can induce insufficient penalization of near-distribution data points, which leads to requiring a large number of ensemble networks. Theoretically, the variance of the Q-values for an OOD action along wmin is upper-bounded by some constant multiple of $\epsilon$.

Algo

Based on SAC, and add Ensembled-Similarity objective to the gradient of Q-value function.

Env & Baselines

D4RL Mujoco and Adroit
SAC, CQL, REM, BC

Exp

Evaluation on D4RL MuJoCo Gym tasks (baselines) $\to$
- The performance gap of EDAC is especially high for random, medium, and medium-replay datasets.
- EDAC uses a much smaller Q-ensemble size compared to SAC-N.
- EDAC choose from a more diverse range of actions compared to CQL.
Evaluation on D4RL Adroit tasks (baselines) $\to$ Better performance.
Computational cost comparison (SAC CQL) $\to$ EDAC runs faster than CQL with comparable memory consumption.

学术

#RL #Algorithm #Online

【论文随笔】EDAC

https://jasonzhujp.github.io/2023/04/11/paper-rl-07/

作者

Jason Zhu

发布于

2023年4月11日

许可协议

【论文随笔】CQL 上一篇

【论文随笔】SAC 下一篇