【论文随笔】EDAC
Uncertainty-Based Offline Reinforcement Learning with Diversified Q-Ensemble NeurIPS 2021
Problem
- Prior methods typically require accurate estimation of the behavior policy or sampling from OOD data points, which are hard.
- Prior methods under-utilize the generalization ability and often fall into suboptimal solutions too close to the given dataset. (over conservative)
Idea
Propose an uncertainty-based, ensemble-diversified actor-critic offline RL method with clipped Q-learning, which invloves confidence of the Q-value prediction and does not require estimation or sampling of the data distribution.
Observation
- Clipped Q-learning can be leveraged to successfully penalize OOD data points with high prediction uncertainties. The size of the penalty and the standard deviation are highly correlated.
- It is useful to increase the number of Q-networks along with the clipped Q-learning for better performance.
Theory
- Clipped Q-learning can be either interpreted as penalizing state-action pairs with high-variance Q-value estimates, or utilizing the lower-confidence bound of the Q-value predictions. Theoratically, using the clipped Q-value is similar to penalizing the ensemble mean of the Q-values with the standard deviation scaled by a coefficient dependent on N.
- The alignment of the input gradients can induce insufficient penalization of near-distribution data points, which leads to requiring a large number of ensemble networks. Theoretically, the variance of the Q-values for an OOD action along wmin is upper-bounded by some constant multiple of $\epsilon$.
Algo
Based on SAC, and add Ensembled-Similarity objective to the gradient of Q-value function.
Env & Baselines
D4RL Mujoco and Adroit
SAC, CQL, REM, BC
Exp
- Evaluation on D4RL MuJoCo Gym tasks (baselines) $\to$
- The performance gap of EDAC is especially high for random, medium, and medium-replay datasets.
- EDAC uses a much smaller Q-ensemble size compared to SAC-N.
- EDAC choose from a more diverse range of actions compared to CQL.
- Evaluation on D4RL Adroit tasks (baselines) $\to$ Better performance.
- Computational cost comparison (SAC CQL) $\to$ EDAC runs faster than CQL with comparable memory consumption.
【论文随笔】EDAC
https://jasonzhujp.github.io/2023/04/11/paper-rl-07/