浙江大学学报(工学版)
浙江大學學報(工學版)
절강대학학보(공학판)
JOURNAL OF ZHEJIANG UNIVERSITY(ENGINEERING SCIENCE)
2015年
2期
303-308
,共6页
数据集成%数据源质量%真值发现
數據集成%數據源質量%真值髮現
수거집성%수거원질량%진치발현
data integration%quality of data sources%truth finder
针对目前冲突数据源的质量评价模型仅考虑准确度与精确度2个方面,没有考虑数据源提供错误描述与提供空值对数据源质量会产生不同影响的情况,通过将数据源提供的错误描述定义为主动错误,并将数据源没有为实体提供描述定义为被动错误,从主动错误、被动错误2个方面建立数据源质量模型。该模型以敏感度、明确度代替了准确度与精确度;为了处理多真值问题,预先合并数据源对实体的描述,并定义了合并描述的包含关系及包含度计算模型;在包含度计算模型的基础上,提出了基于描述包含度的冲突数据源质量评价算法(T FD Q )。在通用数据集Books‐Authors上的实验表明,与Vote算法、TruthFinder算法相比,TFDQ算法实验结果更接近真实情况。
針對目前遲突數據源的質量評價模型僅攷慮準確度與精確度2箇方麵,沒有攷慮數據源提供錯誤描述與提供空值對數據源質量會產生不同影響的情況,通過將數據源提供的錯誤描述定義為主動錯誤,併將數據源沒有為實體提供描述定義為被動錯誤,從主動錯誤、被動錯誤2箇方麵建立數據源質量模型。該模型以敏感度、明確度代替瞭準確度與精確度;為瞭處理多真值問題,預先閤併數據源對實體的描述,併定義瞭閤併描述的包含關繫及包含度計算模型;在包含度計算模型的基礎上,提齣瞭基于描述包含度的遲突數據源質量評價算法(T FD Q )。在通用數據集Books‐Authors上的實驗錶明,與Vote算法、TruthFinder算法相比,TFDQ算法實驗結果更接近真實情況。
침대목전충돌수거원적질량평개모형부고필준학도여정학도2개방면,몰유고필수거원제공착오묘술여제공공치대수거원질량회산생불동영향적정황,통과장수거원제공적착오묘술정의위주동착오,병장수거원몰유위실체제공묘술정의위피동착오,종주동착오、피동착오2개방면건립수거원질량모형。해모형이민감도、명학도대체료준학도여정학도;위료처리다진치문제,예선합병수거원대실체적묘술,병정의료합병묘술적포함관계급포함도계산모형;재포함도계산모형적기출상,제출료기우묘술포함도적충돌수거원질량평개산법(T FD Q )。재통용수거집Books‐Authors상적실험표명,여Vote산법、TruthFinder산법상비,TFDQ산법실험결과경접근진실정황。
Existing evaluating models for conflicting data sources usually take nothing but accuracy and precision into account ,ignoring different impacts to the quality of data sources caused by false data values and empty values .In this paper ,false descriptions provided by data sources were defined as initiative errors , while empty values were defined as passive errors . A new quality evaluating model was constructed ,in which accuracy and precision were respectively substituted by sensitivity and specificity . Multiple descriptions from different sources were merged and a notion of inclusion relation as well as a calculating model for inclusion degrees was proposed as pretreatments to deal with multi‐value problems . An evaluating algorithm TFDQ for conflicting data source quality based on the calculating model was put forward .Experiments on the universal data set Books‐Authors show that the result from TFDQ is closer to the reality comparing to the classic Vote and T ruthFinder algorithms .