计算机科学与探索
計算機科學與探索
계산궤과학여탐색
JOURNAL OF FRONTIERS OF COMPUTER SCIENCE & TECHNOLOGY
2014年
5期
630-639
,共10页
江雨燕%李平%王清%李常训
江雨燕%李平%王清%李常訓
강우연%리평%왕청%리상훈
隐藏狄利克雷分配(LDA)%监督主题模型%文档聚类%作者预测
隱藏狄利剋雷分配(LDA)%鑑督主題模型%文檔聚類%作者預測
은장적리극뢰분배(LDA)%감독주제모형%문당취류%작자예측
latent Dirichlet allocation (LDA)%supervised topic model%documents clustering%predicting authors
当前监督或半监督隐藏狄利克雷分配(latent Dirichlet allocation,LDA)模型多数采用DSTM(down-stream supervised topic model)或USTM(upstream supervised topic model)方式加入额外信息,使得模型具有较高的主题提取和数据降维能力,然而无法处理包含多种额外信息的学术文档数据。通过对LDA及其扩展模型的研究,提出了一种将DSTM和USTM结合的概率主题模型ART(author & reference topic)。ART模型分别以USTM和DSTM方式构建了文档作者和引用文献的生成过程,因此可以对既包含作者信息又包含引用文献信息的文档进行有效的分析处理。在实验过程中采用Stochastic EM Sampling 方法对模型参数进行了学习,并将实验结果与Labeled LDA和DMR模型进行了对比。实验结果表明,ART模型不仅拥有高效的文档主题提取和聚类能力,同时还拥有优良的文档作者判别和引用文献排序能力。
噹前鑑督或半鑑督隱藏狄利剋雷分配(latent Dirichlet allocation,LDA)模型多數採用DSTM(down-stream supervised topic model)或USTM(upstream supervised topic model)方式加入額外信息,使得模型具有較高的主題提取和數據降維能力,然而無法處理包含多種額外信息的學術文檔數據。通過對LDA及其擴展模型的研究,提齣瞭一種將DSTM和USTM結閤的概率主題模型ART(author & reference topic)。ART模型分彆以USTM和DSTM方式構建瞭文檔作者和引用文獻的生成過程,因此可以對既包含作者信息又包含引用文獻信息的文檔進行有效的分析處理。在實驗過程中採用Stochastic EM Sampling 方法對模型參數進行瞭學習,併將實驗結果與Labeled LDA和DMR模型進行瞭對比。實驗結果錶明,ART模型不僅擁有高效的文檔主題提取和聚類能力,同時還擁有優良的文檔作者判彆和引用文獻排序能力。
당전감독혹반감독은장적리극뢰분배(latent Dirichlet allocation,LDA)모형다수채용DSTM(down-stream supervised topic model)혹USTM(upstream supervised topic model)방식가입액외신식,사득모형구유교고적주제제취화수거강유능력,연이무법처리포함다충액외신식적학술문당수거。통과대LDA급기확전모형적연구,제출료일충장DSTM화USTM결합적개솔주제모형ART(author & reference topic)。ART모형분별이USTM화DSTM방식구건료문당작자화인용문헌적생성과정,인차가이대기포함작자신식우포함인용문헌신식적문당진행유효적분석처리。재실험과정중채용Stochastic EM Sampling 방법대모형삼수진행료학습,병장실험결과여Labeled LDA화DMR모형진행료대비。실험결과표명,ART모형불부옹유고효적문당주제제취화취류능력,동시환옹유우량적문당작자판별화인용문헌배서능력。
Most of supervised and semi-supervised latent Dirichlet allocation (LDA) models add metadata based on DSTM (downstream supervised topic model) or USTM (upstream supervised topic model) methods, which can improve the capabilities of topics extraction and dimension reduction. However those models can not analyze academic documents which have more than one kind of metadata. Based on the research on the LDA model and its modifications, this paper proposes a new LDA model namely author&references topic (ART) model. The ART model defines the generation process of authors and references by USTM and DSTM which makes the model be able to analyze docu-ments both with authors and references information. In the experiment, Stochastic EM Sampling method is used to learn the parameters of ART model and the ART model is compared with Labeled LDA and DMR models. The experimental results show that the ART model not only has efficient capabilities of academic documents topic extraction and clustering, but also can give an accurate prediction of authors for a new document.