【论文随笔】AMPL

A Unified Framework for Alternating Offline Model Training and Policy Learning NeurIPS 2022

Problem

The learning objectives for the dynamic models and the policies are isolated. Such objective mismatch may lead to inferior performance of the learned agents.

Idea

Maximize a lower bound of the true expected return, by alternating between dynamic-model training and policy learning. Propose a fixed-point-style method for marginal importance weights (MIW) estimation between the offline-data distribution and the one of the current policy.

Theory

Present a tractable upper bound for evaluation error. We can fix $\pi$ and train the dynamic model $\hat{P}$ by minimizing $D_{\pi}\left(P^{\ast},\hat{P}\right)$ w.r.t. $\hat{P}$. Similarly, we can fix the dynamic model $\hat{P}$ and learn a policy $\pi$ to maximize the lower bound of $J(\pi,P^{\ast})$.
On finite state-action space, if the current policy $\pi$ is close to the behavior policy $\pi_{b}$, then the iterate for $\omega$ defined by $\mathcal{T}$ converges geometrically.

Algo

Dynamic model training: Expanding the KL term in $D_{\pi}\left(P^{\ast},\hat{P}\right)$: If the policy $\pi$ is fixed, the term ① is a constant w.r.t. $\hat{P}$. Given the MIW $\omega$, we can optimize $\hat{P}$ by minimizing the following loss: which is an MLE objective weighted by $\omega\left(s,a\right)$.
[Practical] Use an ensemble of Gaussian probabilistic networks $\hat{P}\left(\cdot|s,a\right)$ and $\hat{r}\left(s,a\right)$. Initialize the dynamic model by standard MLE training, and periodically update the model by minimizing Eq.(6).
Marginal importance weight $\omega\left(s,a\right)$ training: A “Bellman equation” for $\omega\left(s’,a’\right)$ is: The update iterate defined by $\mathcal{T}$ has the convergence property according to theory 2.
[Practical] After multiplying $Q^{\hat{P}}_{\pi}$ to both sides of Eq.(9) and subsequently summing over $\omega\left(s,a\right)$ on both sides, a tractable objective is shown below: We can optimize $\omega$ by minimizing the difference between the RHS and the LHS of Eq.(10).
Critic training: [Practical] Use the conservative target:
Estimating $D_{\pi}\left(P^{\ast},\hat{P}\right)$: Use the dual representation of the Jensen–Shannon divergence to approximate $D_{\pi}\left(P^{\ast},\hat{P}\right)$:
[Practical] Approximately minimize $D_{\pi}\left(P^{\ast},\hat{P}\right)$ using GAN structure.
Policy training: [Practical]

Env & Baselines

D4RL(Mujoco, Maze2D, Adroit)
AlgaeDICE, OptiDICE, CQL, FisherBRC, TD3+BC, MOPO, COMBO, WMOPO

Exp

Continuous-control results $\to$ Compared with the DICE-based methods, AMPL shows better performance, indicating that maximize a lower bound of the true expected return can be a more effective framework than explicit stationary-distribution correction, since maximizing the lower bound is more directly related to RL’s goal of maximizing the policy performance.
Ablation Study $\to$
- The effect of MIW-weighted model (re)training. $\to$ Train the model only at beginning using MLE.
- The effect of MIW estimation. $\to$ Variational Power Method, GENeralized stationary DIstribution Correction Estimation, and Dual stationary DIstribution Correction Estimation.
- The effect of wieghted regularizer for policy learning. $\to$ Policy regularizer that weighted by the MIW $\omega$.

学术

#RL #Offline #Algorithm

【论文随笔】AMPL

https://jasonzhujp.github.io/2023/06/08/paper-rl-15/

作者

Jason Zhu

发布于

2023年6月8日

许可协议

【论文随笔】CABI 下一篇