【论文随笔】MAPLE
Offline Model-based Adaptable Policy Learning NeurIPS 2021
Problem
Offline RL methods that learn in the policy space constrained to in-support regions by the offline dataset can limit the potential of the outcome policies.
Idea
Model OOD regions by constructing all possible transition patterns and give a probe-reduce paradigm for decision-making in OOD regions.
First, construct ensemble dynamics models.
Then, to be aware of each case of the transition dynamics and learn an adaptable policy, we use a meta-learning technique that introduces an extra environment-context extractor structure to represent dynamics patterns, and the policy adjusts itself according to the environment contexts.
Theory
- Prove the performance gap between proposed adapative policy $\pi_{a}$ and common constraint-based policy $\pi_{c}$.
Algo
Context-aware adaptable policy $\pi_{\phi}$ : The context-aware policy $\pi \left( a | s, z \right)$ takes actions based on the current state $s$ and the vector of environment-context $z$ of a given environment $z=\phi \left( \hat{\rho} \right)$. Optimize $\phi$ and $\pi_{\phi}$ by the following objective function:
Environment-context extractor $\phi$ : Using a RNN to embed the sequential information into environment-context vectors $z_{t}=\phi \left( s_{t}, a_{t-1}, z_{t-1} \right)$. The extractor is trained via Probe-Reduce loop. The context-aware policy can automatically probe environments and reduce the policy set.
Dynamic model set: Use ensemble technique which would predict similar transitions in the accessible space and predict different transitions in inaccessible space. Several tricks are added to mitigate compounding error:
- Constrain the maximum rollout length.
- Penalize reward based on uncertainty (same as MOPO but with relaxed coefficients)
- Terminate trajectories when the predicted next states are out of range of $\left( s_{min}, s_{max} \right)$.
Env & Baselines
D4RL(Mujoco)
MOPO, SAC, BEAR, BRAC-v, CQL, BC
Exp
- Comparative Evaluation on Benchmark Tasks (baselines) $\to$ MAPLE performs better than other SOTA in 7 out of 12 tasks.
- Analysis of MAPLE $\to$ Analyze the relationship among constraint degree (maximum rollout horizon), the size of ensemble models, and the asymptotic performance.
- MAPLE with large dynamics model set $\to$ Increasing the model size is significantly helpful to find a better and robust adaptable policy via expanding the exploration boundary.