Onder Tutsoy, Martin Brown, and Hong Wang
Temporal difference learning, value function approximation, polynomial basis function, rate of convergence
Reinforcement learning is a method focusing on learning appropriate control actions by maximizing a numerical reward. Since it is able to solve complex control/learning problems without knowing the dynamics of systems, it has been applied to various systems such as humanoid robots and autonomous helicopters. However, properties of learning value function, which is the long term performance of learning, has not been examined on a simple system. In this paper, a simple first order unstable plant with a linear piecewise control is introduced to lead explicit parameter convergence analysis of the value function. A specific closed form solution for the value function is determined consisting of optimal parameters and an optimal polynomial basis. It is shown that a number of parameters occur, which are function of plant parameters and value function discount factor. It is also proved that the temporal difference error introduces an almost null space until cut off point of piecewise linear control. Moreover, it is shown that residual gradient algorithm converges faster than TD(0).
Important Links:
Go Back