【论文随笔】MOPP
Model-Based Offline Planning with Trajectory Pruning IJCAI 2022
Problem
- A light-weighted policy improvement procedure is required by real-world robotic and industrial setting.
- Industrial control tasks often require extra control flexibility (change reward signals or involve constraints due to safety considerations).
- Try to tackles the dilemma between the restrictions of offline learning and high-performance planning.
Idea
MOPP encourages more aggressive trajectory rollout guided by the behavior policy learned from data to improve exploration, and prunes out problematic trajectories to avoid potential out-of-distribution samples.
Algo
- Dynamics and Behavior Policy Learning: Use autoregressive dynamics model (ADM) to learn the probabilistic dynamics model and behavior policy, which are learned by maximizing log-likelihood on dataset.
Use ensembles of $K$ ADMs with randomly permuted orderings for dynamics and behavior policy. - Value Function Evaluation: Learning Q-value function through minimizing TD error with offline dataset.
Then, calculate value function by sampling. - Offline Planning:
- Guided Trajectory Rollout: Use the learned behavior policy to sample rollouts in learned dynamic model with a higher degree of freedom. Then, perform max-Q operation on sampled actions based on Q.
- Trajectory Pruning: Same as MOReL.
- Trajectory Optimization: The optimized action is obtained by re-weighting the actions of each trajectory according to their exponentiated returns.
- Guided Trajectory Rollout: Use the learned behavior policy to sample rollouts in learned dynamic model with a higher degree of freedom. Then, perform max-Q operation on sampled actions based on Q.
Env & Baselines
D4RL(Mujoco, Adroit)
MBPO, MOPO, MBOP, BCQ, CQL
Exp
- Comparative Evaluations [baselines] $\to$
- Performance on Mujoco: MOPP outperforms MBOP in most tasks, which shows great planning improvement upon a random and mixed policy. Model-based MOPO and MBPO can benefit from high-diversity datasets, but MOPP can effectively recover the performant data generating policies in the behavioral data and use planning to further enhance performance.
- Performance on Adroit: Although model-based offline RL methods are known to perform badly on low-diversity datasets, MOPP performs surprisingly well in most cases.
- Ablation Study $\to$ Sampling aggressiveness (std scaling parameter $\sigma_{M}$), value function $V_{b}$ , uncertainty threshold $L$ in trajectory pruning.
- Evaluation on Control Flexibility (halfcheetah-jump and halfcheetah-constrained) $\to$ Only re-evaluate a sub-component of MOPP is required to guarantee the best performance rather than re-train the whole model as in typical RL settings. Adding constraint penalty in the reward function while re-evaluating the $Q_{b}$ via FQE leads to the safest policy.
【论文随笔】MOPP
https://jasonzhujp.github.io/2023/04/25/paper-rl-13/