微型机与应用
微型機與應用
미형궤여응용
MICROCOMPUTER & ITS APPLICATIONS
2011年
18期
62-64
,共3页
Lucene%哈希%整词二分%最大匹配
Lucene%哈希%整詞二分%最大匹配
Lucene%합희%정사이분%최대필배
Lucene%Hash%binary-seek-by-word%maximum matching
针对Lucene自带中文分词器分词效果差的缺点,在分析现有分词词典机制的基础上,设计了基于全哈希整词二分算法的分词器,并集成到Lucene中,算法通过对整词进行哈希,减少词条匹配次数,提高分词效率。该分词器词典文件维护方便,可以根据不同应用的要求进行定制,从而提高了检索效率。
針對Lucene自帶中文分詞器分詞效果差的缺點,在分析現有分詞詞典機製的基礎上,設計瞭基于全哈希整詞二分算法的分詞器,併集成到Lucene中,算法通過對整詞進行哈希,減少詞條匹配次數,提高分詞效率。該分詞器詞典文件維護方便,可以根據不同應用的要求進行定製,從而提高瞭檢索效率。
침대Lucene자대중문분사기분사효과차적결점,재분석현유분사사전궤제적기출상,설계료기우전합희정사이분산법적분사기,병집성도Lucene중,산법통과대정사진행합희,감소사조필배차수,제고분사효솔。해분사기사전문건유호방편,가이근거불동응용적요구진행정제,종이제고료검색효솔。
According to the low efficiency of the Chinese words segmentation machines of Lucene, this paper designs a new word segmentation machine based on all-Hash segmentation mechanism according to binary-seek-by-word by analyzing many old dictionary mechanisms. The new mechanism uses the word's Hash value to reduce the number of string findings. The maintenance of dictionary file is convenient, and the developers can customize the dictionary based on different application to improve search efficiency.