【论文随笔】CQL

Conservative Q-Learning for Offline Reinforcement Learning NuerIPS 2020

Problem

In practice, standard off-policy RL methods can fail due to overestimation induced by distributional shift between the dataset and the learned policy.

Idea

Learn a conservative Q-function such that the expected value of a policy under this Q-function lower-bounds its true value by additionally minimizing expected Q-value under a particular distribution of $\left(s,a\right)$ pairs, and then further tighten this bound by introducing an additional Q-value maximization term under the data distribution.

Theory

The conservative Q-function learned by iterating Equation 1 lower-bounds the true Q-function.
Equation 2 lower-bounds the expected value under the policy $\pi$ when $\mu = \pi$ , while does not lower-bound the Q-value estimates pointwise. (results in a tighter lower bound)
The variant of CQL, $CQL\left(\mathcal{H}\right)$ , learns Q-value estimates that lower-bound the actual Q-function under the action-distribution defined by the policy $\pi^k$, under mild regularity conditions.
Safe Policy Improvement Guarantees: CQL optimizes a well-defined, penalized empirical RL objective, and performs high-confidence safe policy improvement over the behavior policy.

Algo

Propose an actor-critic variant and a Q-learning variant based on CQL.

Env & Baselines

D4RL(Gym, Adroit), Atari games
Cotinuous: BEAR, BRAC, SAC(apadted to offline setting), BC
Discrete: REM, QR-DQN

Exp

Gym domains: CQL(H) better than SOTA by a small margin on single policy, while largely outperforms SOTA on multiple policies.
Adroit, AntMaze and Kitchen: CQL variants are the only methods that improve over BC.
Offline Atari games: As the percentage of top samples descend, the adavange over other discrete baselines is more significant.
Analysis of CQL: Perform empirical evaluation to verify that CQL indeed lower-bounds the value function.

学术

#RL #Offline #Algorithm

【论文随笔】CQL

https://jasonzhujp.github.io/2023/04/12/paper-rl-08/

作者

Jason Zhu

发布于

2023年4月12日

许可协议

【论文随笔】TATU 上一篇

【论文随笔】EDAC 下一篇