【论文随笔】MOPP

Model-Based Offline Planning with Trajectory Pruning   IJCAI 2022

Problem

  • A light-weighted policy improvement procedure is required by real-world robotic and industrial setting.
  • Industrial control tasks often require extra control flexibility (change reward signals or involve constraints due to safety considerations).
  • Try to tackles the dilemma between the restrictions of offline learning and high-performance planning.

Idea

MOPP encourages more aggressive trajectory rollout guided by the behavior policy learned from data to improve exploration, and prunes out problematic trajectories to avoid potential out-of-distribution samples.

Algo

  1. Dynamics and Behavior Policy Learning: Use autoregressive dynamics model (ADM) to learn the probabilistic dynamics model and behavior policy, which are learned by maximizing log-likelihood on dataset.

    Use ensembles of $K$ ADMs with randomly permuted orderings for dynamics and behavior policy.
  2. Value Function Evaluation: Learning Q-value function through minimizing TD error with offline dataset.

    Then, calculate value function by sampling.

  3. Offline Planning
    1. Guided Trajectory Rollout: Use the learned behavior policy to sample rollouts in learned dynamic model with a higher degree of freedom. Then, perform max-Q operation on sampled actions based on Q.

    2. Trajectory Pruning: Same as MOReL.

    3. Trajectory Optimization: The optimized action is obtained by re-weighting the actions of each trajectory according to their exponentiated returns.

Env & Baselines

D4RL(Mujoco, Adroit)
MBPO, MOPO, MBOP, BCQ, CQL

Exp

  • Comparative Evaluations [baselines] $\to$
    • Performance on Mujoco: MOPP outperforms MBOP in most tasks, which shows great planning improvement upon a random and mixed policy. Model-based MOPO and MBPO can benefit from high-diversity datasets, but MOPP can effectively recover the performant data generating policies in the behavioral data and use planning to further enhance performance.
    • Performance on Adroit: Although model-based offline RL methods are known to perform badly on low-diversity datasets, MOPP performs surprisingly well in most cases.
  • Ablation Study $\to$ Sampling aggressiveness (std scaling parameter $\sigma_{M}$), value function $V_{b}$ , uncertainty threshold $L$ in trajectory pruning.
  • Evaluation on Control Flexibility (halfcheetah-jump and halfcheetah-constrained) $\to$ Only re-evaluate a sub-component of MOPP is required to guarantee the best performance rather than re-train the whole model as in typical RL settings. Adding constraint penalty in the reward function while re-evaluating the $Q_{b}$ via FQE leads to the safest policy.

【论文随笔】MOPP
https://jasonzhujp.github.io/2023/04/25/paper-rl-13/
作者
Jason Zhu
发布于
2023年4月25日
许可协议