西南交通大学学报
西南交通大學學報
서남교통대학학보
JOURNAL OF SOUTHWEST JIAOTONG UNIVERSITY
2009年
6期
877-881
,共5页
强化学习%数据驱动%Q-学习%不确定性
彊化學習%數據驅動%Q-學習%不確定性
강화학습%수거구동%Q-학습%불학정성
reinforcement learning%data-driving%Q-learning%uncertainty
针对动态环境下强化学习对未知动作的探索和已知最优动作的利用之间难以平衡的问题,提出了一种数据驱动Q-学习算法.该算法首先构建智能体的行为信息系统,通过行为信息系统知识的不确定性建立环境触发机制;依据跟踪环境变化的动态信息,触发机制自适应控制对新环境的探索,使算法对未知动作的探索和已知最优动作的利用达到平衡.用于动态环境下迷宫导航问题的仿真结果表明,该算法达到目标的平均步长比Q-学习算法、模拟退火Q-学习算法和基于探测刷新Q-学习算法缩短了7.79%~84.7%.
針對動態環境下彊化學習對未知動作的探索和已知最優動作的利用之間難以平衡的問題,提齣瞭一種數據驅動Q-學習算法.該算法首先構建智能體的行為信息繫統,通過行為信息繫統知識的不確定性建立環境觸髮機製;依據跟蹤環境變化的動態信息,觸髮機製自適應控製對新環境的探索,使算法對未知動作的探索和已知最優動作的利用達到平衡.用于動態環境下迷宮導航問題的倣真結果錶明,該算法達到目標的平均步長比Q-學習算法、模擬退火Q-學習算法和基于探測刷新Q-學習算法縮短瞭7.79%~84.7%.
침대동태배경하강화학습대미지동작적탐색화이지최우동작적이용지간난이평형적문제,제출료일충수거구동Q-학습산법.해산법수선구건지능체적행위신식계통,통과행위신식계통지식적불학정성건립배경촉발궤제;의거근종배경변화적동태신식,촉발궤제자괄응공제대신배경적탐색,사산법대미지동작적탐색화이지최우동작적이용체도평형.용우동태배경하미궁도항문제적방진결과표명,해산법체도목표적평균보장비Q-학습산법、모의퇴화Q-학습산법화기우탐측쇄신Q-학습산법축단료7.79%~84.7%.
It is difficult for reinforcement learning to balance between the exploration of untested actions and the exploitation of known optimum actions in dynamic environment. To address this problem, a data-driven Q-learning algorithm was proposed. In this algorithm, the information system ofbehavior is constructed for each agent. Then the trigger mechanism of environment is build by the uncertainty of knowledge in the information system of behavior to trace the environmental change. The dynamic information of the environment is used to exploit new environment by the trigger mechanism to achieve the balance between the exploration of untested actions and the exploitation of know optimum actions. The proposed algorithm was applied to grid-world navigation tasks. The simulation results show that compared with the Q-learning, simulated annealing Q-learning (SAQ) and recency-based exploration (RBE) Q-learning algorithms, the proposed algorithm has a high learning efficiency.