Relative Entropy Regularized Sample Efficient Reinforcement Learning
with Continuous Actions
Abstract
In this paper, a novel reinforcement learning (RL) approach, continuous
dynamic policy programming (CDPP) is proposed to tackle the issues of
both learning stability and sample efficiency in the current RL methods
with continuous actions.
The proposed method naturally extends the relative entropy
regularization from the value function-based framework to the
actor-critic (AC) framework of deep deterministic policy gradient (DDPG)
to stabilize the learning process in continuous action space. It tackles
the intractable softmax operation over continuous actions in the critic
by Monte Carlo estimation and explores the practical advantages of the
Mellowmax operator. A Boltzmann sampling policy is proposed to guide the
exploration of actor following the relative entropy regularized
critic.
Evaluated by several benchmark tasks, the proposed method clearly
illustrates the positive impact of the relative entropy regularization
including efficient exploration behavior and stable policy update in RL
with continuous action space and successfully outperforms the related
baseline approach in both sample efficiency and learning stability.