【论文随笔】SAC

Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor ICML 2018

Idea

Model-free methods typically suffer from very high sample complexity and brittle convergence properties. SAC is an off-policy maximum entropy, stochastic actor-critic algorithm that encourages exploration and provides for both sample efficiency and learning stability.

Theory

Present a convergence proof for policy iteration (policy evaluation and policy improvement) in the maximum entropy framework.

Algo

An actor-critic framework consist of target and current soft value networks $V_{\psi}\left(s_{t}\right)$ and $V_{\bar{\psi}}\left(s_{t}\right)$ , double Q value networks $Q_{\theta_{1,2}}\left(s_{t},a_{t}\right)$ and an actor network $\pi_{\phi}\left(a_{t}\mid s_{t}\right)$.

Env & Baselines

Gym, Humanoid(rllab) [continuous tasks]
DDPG, PPO, SQL, TD3, Trust-PCL

Exp

Comparative Evaluation with DDPG, PPO, SQL, TD3(Gym, Humanoid rllab)
Ablation Study
- Stochastic vs deterministic policy (remove entropy) [Humanoid rllab] $\to$ Stochasticity can stabilize training.
- Policy evaluation [Gym Ant-v1] $\to$ Deterministic evaluation can yield better performance.
- Reward scale hyperparameter [Gym Ant-v1] $\to$ With the right reward scaling, the model balances exploration and exploitation, leading to faster learning and better asymptotic performance.
- Target network update [Gym Ant-v1] $\to$ Large $\tau$ can lead to instabilities while small $\tau$ can make training slower.

学术

#RL #Algorithm #Online

【论文随笔】SAC

https://jasonzhujp.github.io/2023/04/10/paper-rl-06/

作者

Jason Zhu

发布于

2023年4月10日

许可协议

【论文随笔】EDAC 上一篇

【代码】D4RL安装全流程下一篇