【论文随笔】MAPLE

Offline Model-based Adaptable Policy Learning   NeurIPS 2021

Problem

Offline RL methods that learn in the policy space constrained to in-support regions by the offline dataset can limit the potential of the outcome policies.

Idea

Model OOD regions by constructing all possible transition patterns and give a probe-reduce paradigm for decision-making in OOD regions.
First, construct ensemble dynamics models.
Then, to be aware of each case of the transition dynamics and learn an adaptable policy, we use a meta-learning technique that introduces an extra environment-context extractor structure to represent dynamics patterns, and the policy adjusts itself according to the environment contexts.

Theory

  • Prove the performance gap between proposed adapative policy $\pi_{a}$ and common constraint-based policy $\pi_{c}$.

Algo

  • Context-aware adaptable policy $\pi_{\phi}$ : The context-aware policy $\pi \left( a | s, z \right)$ takes actions based on the current state $s$ and the vector of environment-context $z$ of a given environment $z=\phi \left( \hat{\rho} \right)$. Optimize $\phi$ and $\pi_{\phi}$ by the following objective function:

  • Environment-context extractor $\phi$ : Using a RNN to embed the sequential information into environment-context vectors $z_{t}=\phi \left( s_{t}, a_{t-1}, z_{t-1} \right)$. The extractor is trained via Probe-Reduce loop. The context-aware policy can automatically probe environments and reduce the policy set.


  • Dynamic model set: Use ensemble technique which would predict similar transitions in the accessible space and predict different transitions in inaccessible space. Several tricks are added to mitigate compounding error:

    • Constrain the maximum rollout length.
    • Penalize reward based on uncertainty (same as MOPO but with relaxed coefficients)
    • Terminate trajectories when the predicted next states are out of range of $\left( s_{min}, s_{max} \right)$.

Env & Baselines

D4RL(Mujoco)
MOPO, SAC, BEAR, BRAC-v, CQL, BC

Exp

  1. Comparative Evaluation on Benchmark Tasks (baselines) $\to$ MAPLE performs better than other SOTA in 7 out of 12 tasks.
  2. Analysis of MAPLE $\to$ Analyze the relationship among constraint degree (maximum rollout horizon), the size of ensemble models, and the asymptotic performance.
  3. MAPLE with large dynamics model set $\to$ Increasing the model size is significantly helpful to find a better and robust adaptable policy via expanding the exploration boundary.

【论文随笔】MAPLE
https://jasonzhujp.github.io/2023/04/22/paper-rl-10/
作者
Jason Zhu
发布于
2023年4月22日
许可协议