【论文随笔】SAC

Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor   ICML 2018

Idea

Model-free methods typically suffer from very high sample complexity and brittle convergence properties. SAC is an off-policy maximum entropy, stochastic actor-critic algorithm that encourages exploration and provides for both sample efficiency and learning stability.

Theory

Present a convergence proof for policy iteration (policy evaluation and policy improvement) in the maximum entropy framework.

Algo

An actor-critic framework consist of target and current soft value networks $V_{\psi}\left(s_{t}\right)$ and $V_{\bar{\psi}}\left(s_{t}\right)$ , double Q value networks $Q_{\theta_{1,2}}\left(s_{t},a_{t}\right)$ and an actor network $\pi_{\phi}\left(a_{t}\mid s_{t}\right)$.

Env & Baselines

Gym, Humanoid(rllab) [continuous tasks]
DDPG, PPO, SQL, TD3, Trust-PCL

Exp

  1. Comparative Evaluation with DDPG, PPO, SQL, TD3(Gym, Humanoid rllab)
  2. Ablation Study
    • Stochastic vs deterministic policy (remove entropy) [Humanoid rllab] $\to$ Stochasticity can stabilize training.
    • Policy evaluation [Gym Ant-v1] $\to$ Deterministic evaluation can yield better performance.
    • Reward scale hyperparameter [Gym Ant-v1] $\to$ With the right reward scaling, the model balances exploration and exploitation, leading to faster learning and better asymptotic performance.
    • Target network update [Gym Ant-v1] $\to$ Large $\tau$ can lead to instabilities while small $\tau$ can make training slower.

【论文随笔】SAC
https://jasonzhujp.github.io/2023/04/10/paper-rl-06/
作者
Jason Zhu
发布于
2023年4月10日
许可协议