计算机工程与应用
計算機工程與應用
계산궤공정여응용
COMPUTER ENGINEERING AND APPLICATIONS
2014年
19期
123-127
,共5页
相似重复记录%数据清洗%有效权值%SNM算法
相似重複記錄%數據清洗%有效權值%SNM算法
상사중복기록%수거청세%유효권치%SNM산법
approximately duplicate records%data cleaning%effective weights%Sorted-Neighborhood Method(SNM)
异构数据库集成中产生了相似重复记录,但数量是有限的,采用传统的SNM算法进行检测,需要在窗口内对所有记录进行比对,效率不高。针对这一缺陷,提出一种基于长度过滤和有效权值的SNM改进算法,在窗口内根据两条记录的长度比例首先将不可能构成相似重复记录的数据排除在外,减少了记录比较的次数,提高了检测效率;进一步通过设置属性有效性因子和权重比例计算有效权值,利用有效权值进行检测,提高了查全率和查准率。实验证明改进算法在各种性能上均优于SNM算法。
異構數據庫集成中產生瞭相似重複記錄,但數量是有限的,採用傳統的SNM算法進行檢測,需要在窗口內對所有記錄進行比對,效率不高。針對這一缺陷,提齣一種基于長度過濾和有效權值的SNM改進算法,在窗口內根據兩條記錄的長度比例首先將不可能構成相似重複記錄的數據排除在外,減少瞭記錄比較的次數,提高瞭檢測效率;進一步通過設置屬性有效性因子和權重比例計算有效權值,利用有效權值進行檢測,提高瞭查全率和查準率。實驗證明改進算法在各種性能上均優于SNM算法。
이구수거고집성중산생료상사중복기록,단수량시유한적,채용전통적SNM산법진행검측,수요재창구내대소유기록진행비대,효솔불고。침대저일결함,제출일충기우장도과려화유효권치적SNM개진산법,재창구내근거량조기록적장도비례수선장불가능구성상사중복기록적수거배제재외,감소료기록비교적차수,제고료검측효솔;진일보통과설치속성유효성인자화권중비례계산유효권치,이용유효권치진행검측,제고료사전솔화사준솔。실험증명개진산법재각충성능상균우우SNM산법。
Approximately duplicate records are produced in heterogeneous database integration, but the numbers of which are limited. Using the traditional SNM algorithm to detect approximately duplicate records, needs to compare all records in the window, and the efficiency is not high. For the defects, an improved SNM algorithm based on the length filtering and effective weights is proposed. According to the length proportion of two records in the window, the records which are impossible to be approximately duplicate are excluded firstly, so it can reduce the number of records comparison, and improve the detection efficiency. By setting the validity factor and weight proportion of the records attribute furtherly, it calculates the effective weights, then according to the weights, detects the records. The recall ratio and the precision ratio are improved. The results of experiments show that the improved algorithm is better than SNM algorithm in various performance.