CAJ | 학술논문

随着Web数据库数量和其蕴含数据量飞速的增长,对Deep Web数据的集成越来越成为研究领域关注的问题.然而由于Web上的信息以半结构化及无结构化的数据信息居多,导致了抽取的结果中包含诸多的不确定性.如有噪音数据、重复字符、简写与全称混合等问题.这给识别重复记录带来了很大不便,导致传统的去重算法无法达到很好的效果.为此,提出了一种面向deep Web结果整合的重复记录识别模型.在该模型中,提出了一种基于编辑距离的改进算法,基于该算法实现字符串匹配;通过构建属性匹配图,并采用二次确认机制实现识别重复记录.应用该模型,既提高了识别效率又保证了识别精确度,并通过实验证明了提出的算法和模型的可行性.
수착Web수거고수량화기온함수거량비속적증장,대Deep Web수거적집성월래월성위연구영역관주적문제.연이유우Web상적신식이반결구화급무결구화적수거신식거다,도치료추취적결과중포함제다적불학정성.여유조음수거、중복자부、간사여전칭혼합등문제.저급식별중복기록대래료흔대불편,도치전통적거중산법무법체도흔호적효과.위차,제출료일충면향deep Web결과정합적중복기록식별모형.재해모형중,제출료일충기우편집거리적개진산법,기우해산법실현자부천필배;통과구건속성필배도,병채용이차학인궤제실현식별중복기록.응용해모형,기제고료식별효솔우보증료식별정학도,병통과실험증명료제출적산법화모형적가행성.
With the rapid growth of Web database on number and volume,deep Web integration is increasingly becoming a concern research field.However,the information on Web is mostly composed of semi-structured and unstructured data,which contains a lot of uncertainty,such as noise data,repeat characters and the abbreviated name mixed.So identifying duplicate records has become very inconvenient and imprecise.The traditional algorithms can not achieve good results.In this case,a model is proposed to recognize duplicate records from deep Web search results.In the model,an improved edit distance-based algorithm is proposed to match the strings:attributes matching graph is constructed and twice verification strategy is adopted to identify duplicate records.It can achieve both effectiveness and accuracy by using the model,and the experiment results prove the feasibility of the algorithm and the model.