【论文随笔】CQL

Conservative Q-Learning for Offline Reinforcement Learning   NuerIPS 2020

Problem

In practice, standard off-policy RL methods can fail due to overestimation induced by distributional shift between the dataset and the learned policy.

Idea

Learn a conservative Q-function such that the expected value of a policy under this Q-function lower-bounds its true value by additionally minimizing expected Q-value under a particular distribution of $\left(s,a\right)$ pairs, and then further tighten this bound by introducing an additional Q-value maximization term under the data distribution.

Theory

  • The conservative Q-function learned by iterating Equation 1 lower-bounds the true Q-function.
  • Equation 2 lower-bounds the expected value under the policy $\pi$ when $\mu = \pi$ , while does not lower-bound the Q-value estimates pointwise. (results in a tighter lower bound)
  • The variant of CQL, $CQL\left(\mathcal{H}\right)$ , learns Q-value estimates that lower-bound the actual Q-function under the action-distribution defined by the policy $\pi^k$, under mild regularity conditions.
  • Safe Policy Improvement Guarantees: CQL optimizes a well-defined, penalized empirical RL objective, and performs high-confidence safe policy improvement over the behavior policy.

Algo

Propose an actor-critic variant and a Q-learning variant based on CQL.

Env & Baselines

D4RL(Gym, Adroit), Atari games
Cotinuous: BEAR, BRAC, SAC(apadted to offline setting), BC
Discrete: REM, QR-DQN

Exp

  1. Gym domains: CQL(H) better than SOTA by a small margin on single policy, while largely outperforms SOTA on multiple policies.
  2. Adroit, AntMaze and Kitchen: CQL variants are the only methods that improve over BC.
  3. Offline Atari games: As the percentage of top samples descend, the adavange over other discrete baselines is more significant.
  4. Analysis of CQL: Perform empirical evaluation to verify that CQL indeed lower-bounds the value function.

【论文随笔】CQL
https://jasonzhujp.github.io/2023/04/12/paper-rl-08/
作者
Jason Zhu
发布于
2023年4月12日
许可协议