数字图书馆论坛
數字圖書館論罈
수자도서관론단
Digital Library Forum
2015年
10期
14-20
,共7页
盛怡瑾%张学福%孙巍%郝心宁
盛怡瑾%張學福%孫巍%郝心寧
성이근%장학복%손외%학심저
数据清洗%数据匹配%期刊%作者%机构
數據清洗%數據匹配%期刊%作者%機構
수거청세%수거필배%기간%작자%궤구
Data Cleansing%Data Matching%Journals%Author%Institution
为了评价数据匹配算法中常用的四种字段匹配算法——Smith-Waterman算法、编辑距离(Edit Distance)、Q-gram算法和Jaro-Winkler算法的效果和表现,本文选取由水稻领域18个重点期刊集成得到的作者和机构数据设计实验,使用Febrl清洗工具包对相似重复记录进行匹配。结果表明,四种算法适用条件不同, Smith-Waterman算法运行时间特别长,但综合表现以及精度和召回率都不错;编辑距离(Edit Distance)性价比比较高;Q-gram算法运算快但召回率低;Jaro-Winkler算法在此例中表现比较差。
為瞭評價數據匹配算法中常用的四種字段匹配算法——Smith-Waterman算法、編輯距離(Edit Distance)、Q-gram算法和Jaro-Winkler算法的效果和錶現,本文選取由水稻領域18箇重點期刊集成得到的作者和機構數據設計實驗,使用Febrl清洗工具包對相似重複記錄進行匹配。結果錶明,四種算法適用條件不同, Smith-Waterman算法運行時間特彆長,但綜閤錶現以及精度和召迴率都不錯;編輯距離(Edit Distance)性價比比較高;Q-gram算法運算快但召迴率低;Jaro-Winkler算法在此例中錶現比較差。
위료평개수거필배산법중상용적사충자단필배산법——Smith-Waterman산법、편집거리(Edit Distance)、Q-gram산법화Jaro-Winkler산법적효과화표현,본문선취유수도영역18개중점기간집성득도적작자화궤구수거설계실험,사용Febrl청세공구포대상사중복기록진행필배。결과표명,사충산법괄용조건불동, Smith-Waterman산법운행시간특별장,단종합표현이급정도화소회솔도불착;편집거리(Edit Distance)성개비비교고;Q-gram산법운산쾌단소회솔저;Jaro-Winkler산법재차례중표현비교차。
To evaluate the effect and performance of four field matching algorithms commonly used in data matching——Smith-Waterman algorithm, Edit Distance, Q-gram algorithm and Jaro-Winkler algorithm, we chose authors and institutions information integrated from 18 key journals to design experiments, using Febrl to match approximate records. The results showed that the four algorithms have different applicable conditions. Smith-Waterman algorithm runs a particularly long time, but the overal performance, the precision and recal are good. Edit distance is relatively high cost-effective. Q-gram algorithm runs fast but has low recal . Jaro-Winkler algorithm doesn' t perform wel in this case.