华南理工大学学报(自然科学版)
華南理工大學學報(自然科學版)
화남리공대학학보(자연과학판)
JOURNAL OF SOUTH CHINA UNIVERSITY OF TECHNOLOGY(NATURAL SCIENCE EDITION)
2014年
7期
28-32
,共5页
XML%数据集成%文本处理%数据源敏感度
XML%數據集成%文本處理%數據源敏感度
XML%수거집성%문본처리%수거원민감도
XML%data integration%text processing%data source-sensitivity
将预处理后的XML数据当作文本信息采用词频-逆向文档频率( TF-IDF)模型进行处理时,逆向文档频率作为词项权重有其不足之处。为此,文中定义了词项的数据源敏感度作为逆向文档频率( IDF)的修正系数。其值取决于提供此词项的数据来源于不同数据源的概率,概率大则其值大,反之则其值小。然后在修正后的词项权重向量的基础上,定义了相似度函数。最后在模拟、真实数据集上进行数据重复检测实验。结果表明,新方法获得了更高的F测度值。这说明考虑词项的数据源敏感度可提高相似度函数的有效性。
將預處理後的XML數據噹作文本信息採用詞頻-逆嚮文檔頻率( TF-IDF)模型進行處理時,逆嚮文檔頻率作為詞項權重有其不足之處。為此,文中定義瞭詞項的數據源敏感度作為逆嚮文檔頻率( IDF)的脩正繫數。其值取決于提供此詞項的數據來源于不同數據源的概率,概率大則其值大,反之則其值小。然後在脩正後的詞項權重嚮量的基礎上,定義瞭相似度函數。最後在模擬、真實數據集上進行數據重複檢測實驗。結果錶明,新方法穫得瞭更高的F測度值。這說明攷慮詞項的數據源敏感度可提高相似度函數的有效性。
장예처리후적XML수거당작문본신식채용사빈-역향문당빈솔( TF-IDF)모형진행처리시,역향문당빈솔작위사항권중유기불족지처。위차,문중정의료사항적수거원민감도작위역향문당빈솔( IDF)적수정계수。기치취결우제공차사항적수거래원우불동수거원적개솔,개솔대칙기치대,반지칙기치소。연후재수정후적사항권중향량적기출상,정의료상사도함수。최후재모의、진실수거집상진행수거중복검측실험。결과표명,신방법획득료경고적F측도치。저설명고필사항적수거원민감도가제고상사도함수적유효성。
When preprocessed XML data are used as text information to be dealt with by the TF-IDF ( Term Fre-quency-Inverse Document Frequency ) model, the IDF as the weight of terms has imperfection of its own .In order to solve this problem , the data source-sensitivity of terms is defined as the modification coefficient of the IDF .Its value depends on the probability which provides the term with the data from different sources .When the probability is big, its value is big, and vice versa.Then, the similarity function is defined on the basis of the weight vector of the fixed terms.Finally, experiments of detecting duplicate XML data from multiple sources are conducted on real and simulated datasets .The results show that the proposed method achieves a higher F measure value , which indi-cates that the data source-sensitivity of terms helps improve the effectiveness of similarity function .