DUAL AVERAGE TWIN DELAYED DEEP DETERMINISTIC POLICY GRADIENT (DATD3): ADDRESSING ESTIMATION BIAS IN DEEP REINFORCEMENT LEARNING

doi:10.2316/J.2025.206-1205

DUAL AVERAGE TWIN DELAYED DEEP DETERMINISTIC POLICY GRADIENT (DATD3): ADDRESSING ESTIMATION BIAS IN DEEP REINFORCEMENT LEARNING

Yuchi Zhang and Peter Xiaoping Liu

References

[1] W. Wang, L. Li, F. Ye, Y. Peng, and Y. Ma, a large-scalepath planning algorithm for underwater robots based on deepreinforcement learning, International Journal of Robotics andAutomation, 39, 2024, 204–210.
[2] J. Hwangbo, J. Lee, A. Dosovitskiy, D. Bellicoso, V. Tsounis,V. Koltun, and M. Hutter, Learning agile and dynamicmotor skills for legged robots, Science Robotics, 4(26), 2019,5872.
[3] L. Yi, M. Cong, H. Dong, and D. Liu, Reinforcement learningand EGA-based trajectory planning for dual robots, Inter-national Journal of Robotics and Automation, 33(4),2018,206–5084.
[4] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou,D. Wierstra, and M. Ried-miller, Playing Atari with deepreinforcement learning, 2013, arXiv:1312.5602.
[5] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness,M. G. Bellemare, A. Graves, M. Riedmiller, A.K. Fidjeland,8G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou,H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis,,Human-level control through deep reinforcement learning,Nature, 518(7540), 2015,529–533.
[6] Y. Wang, Mastering the game of Gomoku without humanknowledge, Ph.D. dissertation, California Polytechnic StateUniversity, San Luis Obispo, CA, 2018.
[7] D. Silver, A. Huang, C.J. Maddison, A. Guez, L. Sifre, G.V.D.Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam,M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner,I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T.Graepel, and D. Hassabis, Mastering the game of go withdeep neural networks and tree search, Nature, 529(7587), 2016,484–489.
[8] J. Perolat, B.D. Vylder, D. Hennes, E. Tarassov, F. Strub, V.Boer, P. Muller, J.T. Connor, N. Burch, T. Anthony et al,Mastering the game of Stratego with model-free multiagentreinforcement learning, Science, 378(6623), 2022,990–996.
[9] D. Zou, L. Lu, W. Zhang, and J. Guo, A navigationmethod for UUVs under ocean current disturbance based ondeep reinforcement learning, in Proceeding 7th InternationalConference on Advanced Algorithms and Control Engineering(ICAACE), Shanghai,, 2024, 1165–1168.
[10] B. Li, H. Zhang, and X. Shi, A novel path planning for AUVbased on dung beetle optimization algorithm with deep Q-network, International Journal of Robotics and Automation,40, 2024, 65–73.
[11] C. Watkins, Learning from delayed rewards, Ph.D. dissertation,King’s College, London, 1989.
[12] G.A. Rummery and M. Niranjan, On-line Q-learning usingconnectionist systems, Technical Report 166, Department ofEngineering, University of Cambridge, Cambridge, 1994.
[13] R.S. Sutton, Learning to predict by the methods of temporaldiﬀerences, Machine Learning, 3, 1988, 9–44.
[14] R.S. Sutton, D. McAllester, S. Singh, and Y. Mansour, Policygradient methods for reinforcement learning with functionapproximation, in Proceeding Neural Information ProcessingSystems, Denver, CO, 1999, 1057–1063.
[15] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, andM. Riedmiller, Deterministic policy gradient algorithms, inProceeding International Conference on Machine LearningJMLR, , Beijing, 2014, 387–395.
[16] T.P. Lillicrap, J.J. Hunt, A. Pritzel, N.M.O. Heess, T.Erez, Y. Tassa, D. Silver, and D. Wierstra, Continuouscontrol with deep reinforcement learning, 2015, arXiv:1509.02971.
[17] R. Fox, A. Pakman, and N. Tishby, Taming the noise inreinforcement learning via soft updates, 2015, arXiv:1512.08562.
[18] O. Nachum, M. Norouzi, G. Tucker, and D. Schuurmans,Smoothed action value functions for learning Gaussian policies,in Proceeding International Conference on Machine LearningPMLR, Stockholm, 2018, 3692–3700.
[19] H.V. Hasselt, A. Guez, and D. Silver, Deep reinforcementlearning with double Q-learning, in Proceeding AAAIConference on Artiﬁcial Intelligence, Phoenix, AZ, 2016,2094–2100.
[20] O. Anschel, N. Baram, and N. Shimkin, Averaged-DQN:Variance reduction and stabilization for deep reinforcementlearning, in Proceeding International Conference on MachineLearning, Sydney, NSW, 2017, 176–185.
[21] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, Soft actor-critic: Oﬀ-policy maximum entropy deep reinforcement learningwith a stochastic actor, in Proceeding International Conferenceon Machine Learning,, Stockholm, 2018, 1861–1870.
[22] S. Fujimoto, H. Hoof, and D. Meger, Addressing functionapproximation error in actor-critic methods, in ProceedingInternational Conference on Machine Learning, , Stockholm,2018, 1587–1596.
[23] E. Todorov, T. Erez, and Y. Tassa, MuJoCo: A physics enginefor modelbased control, in Proceeding IEEE/RSJ InternationalConference on Intelligent Robots and Systems, , Vilamoura,2012, 5026–5033.
[24] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, andO. Klimov, Proximal policy optimization algorithms, 2017,arXiv:1707.06347.
[25] S. Thrun and A. Schwartz, Issues in using functionapproximation for reinforcement learning, in Proceedings ofthe Connectionist Models Summer School, Hillsdale, NJ, 2014,255–263.
[26] R. Bellman, Dynamic programming, Science, 153(3731), 1966,34–37.
[27] R.S. Sutton and A.G. Barto, Reinforcement learning: Anintroduction.( Cambridge, MA: MIT Press, 2018).

Important Links:

Abstract
DOI: 10.2316/J.2025.206-1205
From Journal (206) International Journal of Robotics and Automation - 2025

Go Back