loading page

Relative Entropy Regularized Sample Efficient Reinforcement Learning with Continuous Actions
  • +2
  • Zhiwei Shang ,
  • Renxing Li ,
  • Chunhua Zheng ,
  • Huiyun Li ,
  • Yunduan Cui
Zhiwei Shang
Author Profile
Renxing Li
Author Profile
Chunhua Zheng
Author Profile
Huiyun Li
Author Profile
Yunduan Cui
Shenzhen Institutes of Advanced Technology

Corresponding Author:[email protected]

Author Profile

Abstract

In this paper, a novel reinforcement learning (RL) approach, continuous dynamic policy programming (CDPP) is proposed to tackle the issues of both learning stability and sample efficiency in the current RL methods with continuous actions.
The proposed method naturally extends the relative entropy regularization from the value function-based framework to the actor-critic (AC) framework of deep deterministic policy gradient (DDPG) to stabilize the learning process in continuous action space. It tackles the intractable softmax operation over continuous actions in the critic by Monte Carlo estimation and explores the practical advantages of the Mellowmax operator. A Boltzmann sampling policy is proposed to guide the exploration of actor following the relative entropy regularized critic.
Evaluated by several benchmark tasks, the proposed method clearly illustrates the positive impact of the relative entropy regularization including efficient exploration behavior and stable policy update in RL with continuous action space and successfully outperforms the related baseline approach in both sample efficiency and learning stability.