【论文随笔】BCQ

Off-Policy Deep Reinforcement Learning without Exploration   ICML 2019

Problem

Due to errors introduced by extrapolation, standard off-policy deep RL algorithms are incapable of learning without data correlated to the distribution under the current policy.

Observation

Off-policy agents perform dramatically worse than the behavioral agent when trained with the same algorithm on the same dataset. It deteriorates rapidly when the data is uncorrelated and the value estimate produced by the deep Q-network diverges.

Idea

Agents are trained to maximize reward while minimizing the mismatch between the state-action visitation of the policy and the state-action pairs contained in the batch. BCQ uses a state-conditioned generative model to produce only previously seen actions.

Theory

  • The cause of extrapolation error can be attributed to absent data, model bias, training mismatch.
  • By inducing a data distribution that is contained entirely within the batch, batch-constrained policies can eliminate extrapolation error entirely for deterministic MDPs.
  • Batch-constrained variant of Q-learning converges to the optimal policy under the same conditions as the standard form of Q-learning.
  • For a deterministic MDP, batch-constrained Q-learning is guaranteed to match or outperform the behavioral policy when starting from any state contained in the batch.

Algo

  • State-conditioned generative model: Sample $n$ actions similar to batch $B$ in terms of current state $s$ implemented with CVAE.

  • Perturbation model: Increase the diversity of seen actions. It outputs an adjustment to an action a in the range $\left[ -\Phi,\Phi \right]$. The policy can be written as:


    The generative model, together with the perturbation model can be viewd as “actor”, which updates via gradient ascend:

  • Variant of clipped double Q-learning: Penalize high variance estimated in regions of uncertainty an push the policy to states contained in batch. Take a convex combination of the two values, with a higher weight on the minimum, to form the taraget Q-value:


Env & Baselines

Mujoco
DDPG, DQN, BC, VAE-BC

Exp

  1. Compare to baselines $\to$ BCQ succeeds at all tasks, matching or outperforming the behavioral policy in each instance with a single set of fixed hyper-parameters.
  2. Performance in imperfect demonstrations $\to$ BCQ is able to strongly outperform the noisy demonstrator, disentangling poor and expert actions in remarkably few iterations.

【论文随笔】BCQ
https://jasonzhujp.github.io/2023/04/23/paper-rl-11/
作者
Jason Zhu
发布于
2023年4月23日
许可协议