DUAL AVERAGE TWIN DELAYED DEEP DETERMINISTIC POLICY GRADIENT (DATD3): ADDRESSING ESTIMATION BIAS IN DEEP REINFORCEMENT LEARNING

Yuchi Zhang; Peter Xiaoping Liu

doi:10.2316/J.2025.206-1205

DUAL AVERAGE TWIN DELAYED DEEP DETERMINISTIC POLICY GRADIENT (DATD3): ADDRESSING ESTIMATION BIAS IN DEEP REINFORCEMENT LEARNING

Yuchi Zhang and Peter Xiaoping Liu

Keywords

Deep reinforcement learning (DRL), estimation error, actor-critic framework, double critics

Abstract

The instability and inaccuracy of value estimation in deep reinforcement learning algorithms adversely aﬀect their performance. This paper introduces the Dual Average Twin Delayed Deep Deterministic Policy Gradient (DATD3) algorithm, which addresses the estimation bias of Q-values prevalent in value-based reinforcement learning, such as in Deep Q-Network (DQN). First, by averaging previously learned q-value estimates from critic networks, DATD3 constructs a temporal diﬀerence target that reduces the variance of target approximation errors and minimises estimation bias in the Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm. Second, to further enhance performance, the DATD3 algorithm performs a second averaging of the ﬁnal output q-values from the two critic networks, replacing the minimisation operation used in the TD3 algorithm. This approach eﬀectively alleviates the underestimation error present in the TD3 algorithm. The eﬀectiveness of DATD3 is validated using several benchmark robots in the Mujoco simulation environment, including the bipedal robot (Walker2d), quadrupedal robot (Ant), swimming robot (Swimmer), and others. Experimental results demonstrate substantial improvements in stability and performance, thereby showcasing the method’s potential for enhancing continuous control tasks in robotics.

Important Links:

References
DOI: 10.2316/J.2025.206-1205
From Journal (206) International Journal of Robotics and Automation - 2025

Go Back