计算机技术与发展
計算機技術與髮展
계산궤기술여발전
COMPUTER TECHNOLOGY AND DEVELOPMENT
2014年
6期
140-144
,共5页
逐点互信息方法%联想词表%查询日志
逐點互信息方法%聯想詞錶%查詢日誌
축점호신식방법%련상사표%사순일지
PMI-IR%thesaurus%query logs
通过对大规模查询日志进行挖掘分析进而提高检索的准确率一直是信息检索领域的热点问题。文章提出一种基于PMI-IR(逐点互信息方法)的联想词表构造方法。该方法利用序列模式挖掘算法扫描大规模用户查询日志,获取共现频次超过某一阈值的词组合,进行聚类获取候选同义词集,然后依次计算词wordA与每个候选词的相似度,选择相似度高于某一阈值的词作为词wordA的联想词集,最后形成联想词表。实验表明,借助该方法得到的联想词表进行扩展查询提高了检索的准确率。
通過對大規模查詢日誌進行挖掘分析進而提高檢索的準確率一直是信息檢索領域的熱點問題。文章提齣一種基于PMI-IR(逐點互信息方法)的聯想詞錶構造方法。該方法利用序列模式挖掘算法掃描大規模用戶查詢日誌,穫取共現頻次超過某一閾值的詞組閤,進行聚類穫取候選同義詞集,然後依次計算詞wordA與每箇候選詞的相似度,選擇相似度高于某一閾值的詞作為詞wordA的聯想詞集,最後形成聯想詞錶。實驗錶明,藉助該方法得到的聯想詞錶進行擴展查詢提高瞭檢索的準確率。
통과대대규모사순일지진행알굴분석진이제고검색적준학솔일직시신식검색영역적열점문제。문장제출일충기우PMI-IR(축점호신식방법)적련상사표구조방법。해방법이용서렬모식알굴산법소묘대규모용호사순일지,획취공현빈차초과모일역치적사조합,진행취류획취후선동의사집,연후의차계산사wordA여매개후선사적상사도,선택상사도고우모일역치적사작위사wordA적련상사집,최후형성련상사표。실험표명,차조해방법득도적련상사표진행확전사순제고료검색적준학솔。
It has been a hot issue in the field of information retrieval to improve the accuracy of retrieval by mining and analyzing large-scale query logs. A kind of method based on PMI-IR for thesaurus construction is put forward in this paper. The method uses prefix span algorithm to scan the user query log,obtaining these words of which the co-occurrence frequency exceeds a certain threshold,constructing synonym candidate set by cluster. And the similarity of wordA and each candidate word is calculated in turn. These words which are a-bove a certain threshold are selected to construct the synonymy thesaurus. Experimental results show that accuracy can be improved using the thesaurus obtained by the method to extend the search.