通信学报
通信學報
통신학보
JOURNAL OF CHINA INSTITUTE OF COMMUNICATIONS
2013年
11期
129-139
,共11页
于俊%刘全%傅启明%孙洪坤%陈桂兴
于俊%劉全%傅啟明%孫洪坤%陳桂興
우준%류전%부계명%손홍곤%진계흥
强化学习%马尔科夫决策过程%优先级扫描%Dyna结构%贝叶斯Q学习
彊化學習%馬爾科伕決策過程%優先級掃描%Dyna結構%貝葉斯Q學習
강화학습%마이과부결책과정%우선급소묘%Dyna결구%패협사Q학습
reinforcement learning%Markov decision process%prioritized sweeping%Dyna architecture%Bayesian Q learning
贝叶斯Q学习方法使用概率分布来描述Q值的不确定性,并结合Q值分布来选择动作,以达到探索与利用的平衡。然而贝叶斯Q学习存在着收敛速度慢且收敛精度低的问题。针对上述问题,提出一种基于优先级扫描Dyna结构的贝叶斯Q学习方法-Dyna-PS-BayesQL。该方法主要分为2部分:在学习部分,对环境的状态迁移函数及奖赏函数建模,并使用贝叶斯Q学习更新动作值函数的参数;在规划部分,基于建立的模型,使用优先级扫描方法和动态规划方法对动作值函数进行规划更新,以提高对历史经验信息的利用,从而提升方法收敛速度及收敛精度。将Dyna-PS-BayesQL应用于链问题和迷宫导航问题,实验结果表明,该方法能较好地平衡探索与利用,且具有较优的收敛速度及收敛精度。
貝葉斯Q學習方法使用概率分佈來描述Q值的不確定性,併結閤Q值分佈來選擇動作,以達到探索與利用的平衡。然而貝葉斯Q學習存在著收斂速度慢且收斂精度低的問題。針對上述問題,提齣一種基于優先級掃描Dyna結構的貝葉斯Q學習方法-Dyna-PS-BayesQL。該方法主要分為2部分:在學習部分,對環境的狀態遷移函數及獎賞函數建模,併使用貝葉斯Q學習更新動作值函數的參數;在規劃部分,基于建立的模型,使用優先級掃描方法和動態規劃方法對動作值函數進行規劃更新,以提高對歷史經驗信息的利用,從而提升方法收斂速度及收斂精度。將Dyna-PS-BayesQL應用于鏈問題和迷宮導航問題,實驗結果錶明,該方法能較好地平衡探索與利用,且具有較優的收斂速度及收斂精度。
패협사Q학습방법사용개솔분포래묘술Q치적불학정성,병결합Q치분포래선택동작,이체도탐색여이용적평형。연이패협사Q학습존재착수렴속도만차수렴정도저적문제。침대상술문제,제출일충기우우선급소묘Dyna결구적패협사Q학습방법-Dyna-PS-BayesQL。해방법주요분위2부분:재학습부분,대배경적상태천이함수급장상함수건모,병사용패협사Q학습경신동작치함수적삼수;재규화부분,기우건립적모형,사용우선급소묘방법화동태규화방법대동작치함수진행규화경신,이제고대역사경험신식적이용,종이제승방법수렴속도급수렴정도。장Dyna-PS-BayesQL응용우련문제화미궁도항문제,실험결과표명,해방법능교호지평형탐색여이용,차구유교우적수렴속도급수렴정도。
In order to balance this trade-off, a probability distribution was used in Bayesian Q learning method to de-scribe the uncertainty of the Q value and choose actions with this distribution. But the slow convergence is a big problem for Bayesian Q-Learning. In allusion to the above problems, a novel Bayesian Q learning algorithm with Dyna architec-ture and prioritized sweeping, called Dyna-PS-BayesQL was proposed. The algorithm mainly includes two parts:in the learning part, it models the transition function and reward function according to collected samples, and update Q value function by Bayesian Q-learning, in the programming part, it updates the Q value function by using prioritized sweeping and dynamic programming methods based on the constructed model, which can improve the efficiency of using the his-torical information. Applying the Dyna-PS-BayesQL to the chain problem and maze navigation problem, the results show that the proposed algorithm can get a good performance of balancing the exploration and exploitation in the learning process, and get a better convergence performance.