计算机研究与发展
計算機研究與髮展
계산궤연구여발전
Journal of Computer Research and Development
2015年
9期
1931-1940
,共10页
真值发现%数据冲突%数据源分类可信性%信息质量%数据融合
真值髮現%數據遲突%數據源分類可信性%信息質量%數據融閤
진치발현%수거충돌%수거원분류가신성%신식질량%수거융합
truth discovery%data conflicting%credibility of data categories on data sources%quality of information%data fusion
网络的普及和电子商务的发展改变了人们信息获取以及消费的方式.Web已经成为大多数人获取信息的重要来源.与此同时,互联网信息质量问题也逐渐凸显.Web中存在大量过时、错误、虚假、片面的信息.其中,不同网站为相同对象提供冲突信息的问题尤为突出.如何从这些冲突信息中找到正确信息成为亟待解决的问题,这类问题又被称为真值发现问题.通过对现有真值发现问题解决方法的调研,发现现有方法均未考虑数据源分类可信性差异对真值发现的影响.因此,提出基于数据源分类可信性的真值发现问题.提出2种方法探测数据源分类可信性差异,并采用贝叶斯的方法迭代计算数据源分类可信性和属性值准确性.另外,通过考虑数据源覆盖率和对象难度对真值发现的影响,进一步提高真值发现算法的准确性.一个真实数据集的实验结果表明,所提方法可以显著提高真值发现的准确性.
網絡的普及和電子商務的髮展改變瞭人們信息穫取以及消費的方式.Web已經成為大多數人穫取信息的重要來源.與此同時,互聯網信息質量問題也逐漸凸顯.Web中存在大量過時、錯誤、虛假、片麵的信息.其中,不同網站為相同對象提供遲突信息的問題尤為突齣.如何從這些遲突信息中找到正確信息成為亟待解決的問題,這類問題又被稱為真值髮現問題.通過對現有真值髮現問題解決方法的調研,髮現現有方法均未攷慮數據源分類可信性差異對真值髮現的影響.因此,提齣基于數據源分類可信性的真值髮現問題.提齣2種方法探測數據源分類可信性差異,併採用貝葉斯的方法迭代計算數據源分類可信性和屬性值準確性.另外,通過攷慮數據源覆蓋率和對象難度對真值髮現的影響,進一步提高真值髮現算法的準確性.一箇真實數據集的實驗結果錶明,所提方法可以顯著提高真值髮現的準確性.
망락적보급화전자상무적발전개변료인문신식획취이급소비적방식.Web이경성위대다수인획취신식적중요래원.여차동시,호련망신식질량문제야축점철현.Web중존재대량과시、착오、허가、편면적신식.기중,불동망참위상동대상제공충돌신식적문제우위돌출.여하종저사충돌신식중조도정학신식성위극대해결적문제,저류문제우피칭위진치발현문제.통과대현유진치발현문제해결방법적조연,발현현유방법균미고필수거원분류가신성차이대진치발현적영향.인차,제출기우수거원분류가신성적진치발현문제.제출2충방법탐측수거원분류가신성차이,병채용패협사적방법질대계산수거원분류가신성화속성치준학성.령외,통과고필수거원복개솔화대상난도대진치발현적영향,진일보제고진치발현산법적준학성.일개진실수거집적실험결과표명,소제방법가이현저제고진치발현적준학성.
The popularization of the network and the development of e‐commerce have changed the way people access information and consume .For most of people ,Web has been the important source of information .Meanwhile ,information quality issue is becoming increasingly prominent .There is a lot of information w hich is outdated ,incorrect ,false and bias .Particularly ,the problem of conflicting information provided by different w ebsites is obvious .It has to be solved that how to find the truth from conflicting information .As we know ,there is not a method which considers the credibility of data categories on data sources during discovering truth .So ,we propose a problem which is truth discovery based credibility of data categories on data sources .In this paper ,tw o methods are proposed to detect the credibility differences of data categories on sources ,and a Bayesian method is used to iteratively compute the data sources quality and data accuracy . Additional , data coverage and the difficulty of each object is considered to improve the accuracy of truth finding .The experiments on a real data set show that our algorithms can significantly improve the accuracy of truth discovery .