【论文随笔】CQL
Conservative Q-Learning for Offline Reinforcement Learning NuerIPS 2020
Problem
In practice, standard off-policy RL methods can fail due to overestimation induced by distributional shift between the dataset and the learned policy.
Idea
Learn a conservative Q-function such that the expected value of a policy under this Q-function lower-bounds its true value by additionally minimizing expected Q-value under a particular distribution of $\left(s,a\right)$ pairs, and then further tighten this bound by introducing an additional Q-value maximization term under the data distribution.
Theory
- The conservative Q-function learned by iterating Equation 1 lower-bounds the true Q-function.
- Equation 2 lower-bounds the expected value under the policy $\pi$ when $\mu = \pi$ , while does not lower-bound the Q-value estimates pointwise. (results in a tighter lower bound)
- The variant of CQL, $CQL\left(\mathcal{H}\right)$ , learns Q-value estimates that lower-bound the actual Q-function under the action-distribution defined by the policy $\pi^k$, under mild regularity conditions.
- Safe Policy Improvement Guarantees: CQL optimizes a well-defined, penalized empirical RL objective, and performs high-confidence safe policy improvement over the behavior policy.
Algo
Propose an actor-critic variant and a Q-learning variant based on CQL.
Env & Baselines
D4RL(Gym, Adroit), Atari games
Cotinuous: BEAR, BRAC, SAC(apadted to offline setting), BC
Discrete: REM, QR-DQN
Exp
- Gym domains: CQL(H) better than SOTA by a small margin on single policy, while largely outperforms SOTA on multiple policies.
- Adroit, AntMaze and Kitchen: CQL variants are the only methods that improve over BC.
- Offline Atari games: As the percentage of top samples descend, the adavange over other discrete baselines is more significant.
- Analysis of CQL: Perform empirical evaluation to verify that CQL indeed lower-bounds the value function.
【论文随笔】CQL
https://jasonzhujp.github.io/2023/04/12/paper-rl-08/