【论文随笔】EDAC

Uncertainty-Based Offline Reinforcement Learning with Diversified Q-Ensemble   NeurIPS 2021

Problem

  • Prior methods typically require accurate estimation of the behavior policy or sampling from OOD data points, which are hard.
  • Prior methods under-utilize the generalization ability and often fall into suboptimal solutions too close to the given dataset. (over conservative)

Idea

Propose an uncertainty-based, ensemble-diversified actor-critic offline RL method with clipped Q-learning, which invloves confidence of the Q-value prediction and does not require estimation or sampling of the data distribution.

Observation

  • Clipped Q-learning can be leveraged to successfully penalize OOD data points with high prediction uncertainties. The size of the penalty and the standard deviation are highly correlated.
  • It is useful to increase the number of Q-networks along with the clipped Q-learning for better performance.

Theory

  • Clipped Q-learning can be either interpreted as penalizing state-action pairs with high-variance Q-value estimates, or utilizing the lower-confidence bound of the Q-value predictions. Theoratically, using the clipped Q-value is similar to penalizing the ensemble mean of the Q-values with the standard deviation scaled by a coefficient dependent on N.
  • The alignment of the input gradients can induce insufficient penalization of near-distribution data points, which leads to requiring a large number of ensemble networks. Theoretically, the variance of the Q-values for an OOD action along wmin is upper-bounded by some constant multiple of $\epsilon$.

Algo

Based on SAC, and add Ensembled-Similarity objective to the gradient of Q-value function.

Env & Baselines

D4RL Mujoco and Adroit
SAC, CQL, REM, BC

Exp

  1. Evaluation on D4RL MuJoCo Gym tasks (baselines) $\to$
    • The performance gap of EDAC is especially high for random, medium, and medium-replay datasets.
    • EDAC uses a much smaller Q-ensemble size compared to SAC-N.
    • EDAC choose from a more diverse range of actions compared to CQL.
  2. Evaluation on D4RL Adroit tasks (baselines) $\to$ Better performance.
  3. Computational cost comparison (SAC CQL) $\to$ EDAC runs faster than CQL with comparable memory consumption.

【论文随笔】EDAC
https://jasonzhujp.github.io/2023/04/11/paper-rl-07/
作者
Jason Zhu
发布于
2023年4月11日
许可协议