计算机工程
計算機工程
계산궤공정
COMPUTER ENGINEERING
2014年
5期
31-35,40
,共6页
多数据源%重复主数据%可信度模型%检测算法%数据可信度
多數據源%重複主數據%可信度模型%檢測算法%數據可信度
다수거원%중복주수거%가신도모형%검측산법%수거가신도
multiple data source%duplicate master data%credibility model%detection algorithm%data credibility
针对来源于多个业务系统的重复主数据影响主数据质量、主数据同步及主数据挖掘等问题,提出重复主数据检测算法fastCdrDetection。从数据可信度的角度出发,在考虑数据源可信度、数据最后更新时间及数据长度的基础上,建立主数据可信度模型,并实现可信记录生成算法。设计非递归的字符串相似度计算算法 FiledMatch,解决了由中文简写、缩写、错误拼写造成的主数据重复问题,采用 sourceKeys 算法对来源于同一业务系统、具有同样业务主键的重复记录进行预处理,从而提高重复主数据检测效率。通过对某电网基建物资63万余条供应商存量数据及23万余条模拟数据进行实验,结果表明,与 PQS 算法相比, fastCdrDetection算法的召回率由74%提高到88%,准确率由61%提高到95%,证明了算法的有效性。
針對來源于多箇業務繫統的重複主數據影響主數據質量、主數據同步及主數據挖掘等問題,提齣重複主數據檢測算法fastCdrDetection。從數據可信度的角度齣髮,在攷慮數據源可信度、數據最後更新時間及數據長度的基礎上,建立主數據可信度模型,併實現可信記錄生成算法。設計非遞歸的字符串相似度計算算法 FiledMatch,解決瞭由中文簡寫、縮寫、錯誤拼寫造成的主數據重複問題,採用 sourceKeys 算法對來源于同一業務繫統、具有同樣業務主鍵的重複記錄進行預處理,從而提高重複主數據檢測效率。通過對某電網基建物資63萬餘條供應商存量數據及23萬餘條模擬數據進行實驗,結果錶明,與 PQS 算法相比, fastCdrDetection算法的召迴率由74%提高到88%,準確率由61%提高到95%,證明瞭算法的有效性。
침대래원우다개업무계통적중복주수거영향주수거질량、주수거동보급주수거알굴등문제,제출중복주수거검측산법fastCdrDetection。종수거가신도적각도출발,재고필수거원가신도、수거최후경신시간급수거장도적기출상,건립주수거가신도모형,병실현가신기록생성산법。설계비체귀적자부천상사도계산산법 FiledMatch,해결료유중문간사、축사、착오병사조성적주수거중복문제,채용 sourceKeys 산법대래원우동일업무계통、구유동양업무주건적중복기록진행예처리,종이제고중복주수거검측효솔。통과대모전망기건물자63만여조공응상존량수거급23만여조모의수거진행실험,결과표명,여 PQS 산법상비, fastCdrDetection산법적소회솔유74%제고도88%,준학솔유61%제고도95%,증명료산법적유효성。
To avoid the effect of duplicate master data from multiple business systems on the quality, synchronization of the master data as well as master data mining, this paper propose a fastCdrDetection(Fast Cluster Duplicate Records Detection) algorithm, in which a duplicate master data detection model and a credible record generating algorithm are included, considering data source reliability, data refresh time and data length. A non-recursive algorithm FiledMatch is established for character string similarity calculation. Aiming at the eliminating problems caused by abbreviations and wrong spellings in Chinese input, a sourceKeys algorithm is constructed for pretreatment of duplicate records arising from a same business system and sharing same business keys to achieve high efficiency in duplicate master data detection. Experiments are carried on a power grid with 630 thousand records of raw material and 230 thousand simulated data records. Result shows that the recall rate of the fastCdrDetection algorithm is 88%, while the PQS algorithm is 74%, and the accuracy is 95%to 61%. The effectiveness of the algorithm is verified.