信息技术与信息化
信息技術與信息化
신식기술여신식화
INFORMATION TECHNOLOGY & INFORMATIZATION
2013年
4期
32-34,40
,共4页
记录排序%重复记录清理%重复记录识别
記錄排序%重複記錄清理%重複記錄識彆
기록배서%중복기록청리%중복기록식별
Scheduling record%Duplicate elimination%Detecting duplicate records
本文改进了重复记录清理算法中所存在的缺陷。改进后的算法,有较好的记录的匹配率保证,而且显著提升了记录排序的效率;在重复记录识别时,考虑了以下五个因素:匹配字段的文字数量、在二个字段中出现的频率、在记录中各字段的权重、中文字段的语义以及语义重点偏后;合并重复记录时采用的策略是聚类和实用算法并用,算法的准确性和健壮性得到了提高。
本文改進瞭重複記錄清理算法中所存在的缺陷。改進後的算法,有較好的記錄的匹配率保證,而且顯著提升瞭記錄排序的效率;在重複記錄識彆時,攷慮瞭以下五箇因素:匹配字段的文字數量、在二箇字段中齣現的頻率、在記錄中各字段的權重、中文字段的語義以及語義重點偏後;閤併重複記錄時採用的策略是聚類和實用算法併用,算法的準確性和健壯性得到瞭提高。
본문개진료중복기록청리산법중소존재적결함。개진후적산법,유교호적기록적필배솔보증,이차현저제승료기록배서적효솔;재중복기록식별시,고필료이하오개인소:필배자단적문자수량、재이개자단중출현적빈솔、재기록중각자단적권중、중문자단적어의이급어의중점편후;합병중복기록시채용적책략시취류화실용산법병용,산법적준학성화건장성득도료제고。
This paper describes some advices for improving the problems in the algorithm of duplicate elimination. The improved duplicate elimination algorithm has effectively promoted the efficiency of scheduling records on the environment that record matching rate was keeping high. In detecting duplicate records, it takes into account 5 factors. For instance, the number of characters ,the frequency of character be found in the 2 ifelds, the importance (weight)of ifeld in records, the Chinese semantic and these mantic focus is always in the back location etc;In merging duplicate records, it uses both the cluster algorithm and practical algorithm to do that. It makes the data cleaning algorithm more accurate and healthier.