【论文随笔】DOGE

When Data Geometry Meets Deep Function:Generalizing Offline Reinforcement Learning ICLR 2023

Problem

Existing offline RL methods are often over-conservative, inevitably hurting generalization performance outside data distribution.

Observation

Existing offline RL(TD3+BC, CQL, IQL) fail when remove only small areas of data on critical regions.
Deep Q functions interpolate well but struggle to extrapolate.

Idea

DOGE marries dataset geometry with deep function approximators in offline RL which is characterized by a state-conditioned distance function. It enables exploitation in generalizable OOD areas rather than strictly constraining policy within data distribution.

Theory

Explain why deep Q functions under neural tanget kernel assumptions interpolate well inside or near the boundaries of the convex hull formed by dataset.
Properties of the optimal state-conditioned distance function $g^{\ast}\left(s,a\right)$.
- The distance function is convex w.r.t. actions is an upper bound of the distance to the state-conditioned centroid $a_{o}\left(s\right)$ of training dataset.
- The negative gradient of the distance function at an extrapolated action $\hat{a}$, $-\nabla_{\hat{a}}g^{\ast}\left(s,\hat{a}\right)$, points inside the convex hull of the dataset.

Define Bellman-consistent coefficient and constrained policy set. Then, present the upper bound of the coefficient and the performance bound of DOGE.

Algo

Learn the state-conditioned distance function $g\left(s, a\right)$ by solving the following regression problem, with $\left(s, a\right) \sim D$ and synthetic noise actions $\hat{a}$ sampled from the uniform distribution over full action space $\mathcal{A}$.

Build DOGE on top of TD3 by simply plugging the state-conditioned distance function as a policy regularization term during policy training process.

$\lambda$ is the Lagrangian multiplier, which is auto-adjusted using dual gradient descent. $Q$ values are rescaled by $\beta=\frac{\alpha}{\frac{1}{n}{\textstyle \sum_{i=1}^{n}}\left|Q\left(s_{i},a_{i}\right)\right|}$ to balance $Q$ function maximization and policy constraint satisfaction.

Env & Baselines

D4RL(Mujoco, AntMaze)
TD3+BC, CQL, IQL, BCQ, BEAR, BC

Exp

Comparison with SOTA [baseline] $\to$ DOGE is the first policy constraint method to successfully solve AntMaze-medium and AntMaze-large tasks.
Evaluation on generalization [TD3+BC, CQL, IQL] (remove small areas of data from the critical pathways to the destination in AntMaze) $\to$ All these methods struggle to generalize well on the missing areas and suffer severe performance drop, while DOGE maintains competitive performance.
Ablation study (hyperparameter $\alpha$, $G$ and $N$) $\to$ $G$ has a more significant impact on the performance. A more tolerant $G$ achieves relatively good performance.

学术

#RL #Offline #Algorithm

【论文随笔】DOGE

https://jasonzhujp.github.io/2023/04/24/paper-rl-12/

作者

Jason Zhu

发布于

2023年4月24日

许可协议

【论文随笔】MOPP 上一篇

【论文随笔】BCQ 下一篇