The method uses actor-critic architecture, in which a recursive least-squares TD method is used to estimate parameters of value function during critic training and a value gradient method is used to improve control policy during actor training.
英
美
- 该方法采用动作-评价者结构,在评价者训练中使用递推最小二乘TD(RLS-TD)方法估计值函数参数,在动作者训练中使用值梯度下降方法改进控制策略。