Yuchi Zhang and Peter Xiaoping Liu
Deep reinforcement learning (DRL), estimation error, actor-critic framework, double critics
The instability and inaccuracy of value estimation in deep reinforcement learning algorithms adversely affect their performance. This paper introduces the Dual Average Twin Delayed Deep Deterministic Policy Gradient (DATD3) algorithm, which addresses the estimation bias of Q-values prevalent in value-based reinforcement learning, such as in Deep Q-Network (DQN). First, by averaging previously learned q-value estimates from critic networks, DATD3 constructs a temporal difference target that reduces the variance of target approximation errors and minimises estimation bias in the Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm. Second, to further enhance performance, the DATD3 algorithm performs a second averaging of the final output q-values from the two critic networks, replacing the minimisation operation used in the TD3 algorithm. This approach effectively alleviates the underestimation error present in the TD3 algorithm. The effectiveness of DATD3 is validated using several benchmark robots in the Mujoco simulation environment, including the bipedal robot (Walker2d), quadrupedal robot (Ant), swimming robot (Swimmer), and others. Experimental results demonstrate substantial improvements in stability and performance, thereby showcasing the method’s potential for enhancing continuous control tasks in robotics.
Important Links:
Go Back