【论文随笔】MOPP

Model-Based Offline Planning with Trajectory Pruning IJCAI 2022

Problem

A light-weighted policy improvement procedure is required by real-world robotic and industrial setting.
Industrial control tasks often require extra control flexibility (change reward signals or involve constraints due to safety considerations).
Try to tackles the dilemma between the restrictions of offline learning and high-performance planning.

Idea

MOPP encourages more aggressive trajectory rollout guided by the behavior policy learned from data to improve exploration, and prunes out problematic trajectories to avoid potential out-of-distribution samples.

Algo

Dynamics and Behavior Policy Learning: Use autoregressive dynamics model (ADM) to learn the probabilistic dynamics model and behavior policy, which are learned by maximizing log-likelihood on dataset.
Use ensembles of $K$ ADMs with randomly permuted orderings for dynamics and behavior policy.
Value Function Evaluation: Learning Q-value function through minimizing TD error with offline dataset.
Then, calculate value function by sampling.
Offline Planning：
1. Guided Trajectory Rollout: Use the learned behavior policy to sample rollouts in learned dynamic model with a higher degree of freedom. Then, perform max-Q operation on sampled actions based on Q.
2. Trajectory Pruning: Same as MOReL.
3. Trajectory Optimization: The optimized action is obtained by re-weighting the actions of each trajectory according to their exponentiated returns.

Env & Baselines

D4RL(Mujoco, Adroit)
MBPO, MOPO, MBOP, BCQ, CQL

Exp

Comparative Evaluations [baselines] $\to$
- Performance on Mujoco: MOPP outperforms MBOP in most tasks, which shows great planning improvement upon a random and mixed policy. Model-based MOPO and MBPO can benefit from high-diversity datasets, but MOPP can effectively recover the performant data generating policies in the behavioral data and use planning to further enhance performance.
- Performance on Adroit: Although model-based offline RL methods are known to perform badly on low-diversity datasets, MOPP performs surprisingly well in most cases.
Ablation Study $\to$ Sampling aggressiveness (std scaling parameter $\sigma_{M}$), value function $V_{b}$ , uncertainty threshold $L$ in trajectory pruning.
Evaluation on Control Flexibility (halfcheetah-jump and halfcheetah-constrained) $\to$ Only re-evaluate a sub-component of MOPP is required to guarantee the best performance rather than re-train the whole model as in typical RL settings. Adding constraint penalty in the reward function while re-evaluating the $Q_{b}$ via FQE leads to the safest policy.

学术

#RL #Offline #Algorithm

【论文随笔】MOPP

https://jasonzhujp.github.io/2023/04/25/paper-rl-13/

作者

Jason Zhu

发布于

2023年4月25日

许可协议

【论文随笔】CABI 上一篇

【论文随笔】DOGE 下一篇