【论文随笔】DOGE
When Data Geometry Meets Deep Function:Generalizing Offline Reinforcement Learning ICLR 2023
Problem
Existing offline RL methods are often over-conservative, inevitably hurting generalization performance outside data distribution.
Observation
- Existing offline RL(TD3+BC, CQL, IQL) fail when remove only small areas of data on critical regions.
- Deep Q functions interpolate well but struggle to extrapolate.
Idea
DOGE marries dataset geometry with deep function approximators in offline RL which is characterized by a state-conditioned distance function. It enables exploitation in generalizable OOD areas rather than strictly constraining policy within data distribution.
Theory
- Explain why deep Q functions under neural tanget kernel assumptions interpolate well inside or near the boundaries of the convex hull formed by dataset.
- Properties of the optimal state-conditioned distance function $g^{\ast}\left(s,a\right)$.
- The distance function is convex w.r.t. actions is an upper bound of the distance to the state-conditioned centroid $a_{o}\left(s\right)$ of training dataset.
- The negative gradient of the distance function at an extrapolated action $\hat{a}$, $-\nabla_{\hat{a}}g^{\ast}\left(s,\hat{a}\right)$, points inside the convex hull of the dataset.
- Define Bellman-consistent coefficient and constrained policy set. Then, present the upper bound of the coefficient and the performance bound of DOGE.
Algo
Learn the state-conditioned distance function $g\left(s, a\right)$ by solving the following regression problem, with $\left(s, a\right) \sim D$ and synthetic noise actions $\hat{a}$ sampled from the uniform distribution over full action space $\mathcal{A}$.
Build DOGE on top of TD3 by simply plugging the state-conditioned distance function as a policy regularization term during policy training process.
Env & Baselines
D4RL(Mujoco, AntMaze)
TD3+BC, CQL, IQL, BCQ, BEAR, BC
Exp
- Comparison with SOTA [baseline] $\to$ DOGE is the first policy constraint method to successfully solve AntMaze-medium and AntMaze-large tasks.
- Evaluation on generalization [TD3+BC, CQL, IQL] (remove small areas of data from the critical pathways to the destination in AntMaze) $\to$ All these methods struggle to generalize well on the missing areas and suffer severe performance drop, while DOGE maintains competitive performance.
- Ablation study (hyperparameter $\alpha$, $G$ and $N$) $\to$ $G$ has a more significant impact on the performance. A more tolerant $G$ achieves relatively good performance.