摘要
We introduce a generalization of temporal-difference (TD) learning to networks of interrelated predictions. Rather than relating a single prediction to itself at a later time, as in conventional TD methods, a TD network relates each prediction in a set of predictions to other predictions in the set at a later time. TD networks can represent and apply TD learning to a much wider class of predictions than has previously been possible. Using a random-walk example, we show that these networks can be used to learn to predict by a fixed interval, which is not possible with conventional TD methods. Secondly, we show that if the inter-predictive relationships are made conditional on action, then the usual learning-efficiency advantage of TD methods over Monte Carlo (supervised learning) methods becomes particularly pronounced. Thirdly, we demonstrate that TD networks can learn predictive state representations that enable exact solution of a non-Markov problem. A very broad range of inter-predictive temporal relationships can be expressed in these networks. Overall we argue that TD networks represent a substantial extension of the abilities of TD methods and bring us closer to the goal of representing world knowledge in entirely predictive, grounded terms.
摘要译文
我们引入时间差(TD)学习到相互关联的预测网络的泛化。而不是在稍后的时间将单个预测与自身相关联,如在常规TD方法中,TD网络将一组预测中的每个预测与以后的集合中的其他预测相关联。TD网络可以将TD学习代表和应用于比以前可能的更广泛的预测类别。使用随机行走的例子,我们表明这些网络可以用来学习以固定的间隔预测,这是传统的TD方法是不可能的。其次,我们表明,如果预测间关系是以行动为条件的,那么TD方法在蒙特卡罗(监督学习)方法中通常的学习效率优势变得尤为明显。第三,我们展示了TD网络可以学习预测状态表示,使非马尔可夫问题的精确解得到解决。在这些网络中可以表现出非常广泛的预测间时间关系。总的来说,我们认为,TD网络代表了TD方法能力的实质性延伸,并使我们更接近于完全预测世界知识的目标,接地条款。
Richard S. Sutton[1];Brian Tanner[1]. "Temporal-difference networks"[C]//NIPS'04:Proceedings of the 17th International Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, CA: ACM, 2004: 1377-1384