计算机工程与应用
計算機工程與應用
계산궤공정여응용
Computer Engineering and Applications
2015年
19期
152-157
,共6页
K-means算法%动态聚类%特征选择%信息熵
K-means算法%動態聚類%特徵選擇%信息熵
K-means산법%동태취류%특정선택%신식적
K-means algorithm%dynamic clustering%feature selection%information entropy
根据科技文献的结构特点,搭建了一个四层挖掘模式,提出了一种应用于科技文献分类的文本特征选择方法。该方法首先依据科技文献的结构将其分为四个层次,然后采用K-means聚类对前三层逐层实现特征词提取,最后再使用Aprori算法找出第四层的最大频繁项集,并作为第四层的特征词集合。在该方法中,针对K-means算法受初始中心点的影响较大的问题,首先采用信息熵对聚类对象赋权的方式来修正对象间的距离函数,然后再利用初始聚类的赋权函数值选出较合适的初始聚类中心点。同时,通过为K-means算法的终止条件设定标准值,来减少算法迭代次数,以减少学习时间;通过删除由信息动态变化而产生的冗余信息,来减少动态聚类过程中的干扰,从而使算法达到更准确更高效的聚类效果。上述措施使得该文本特征选择方法能够在文献语料库中更加准确地找到特征词,较之以前的方法有很大提升,尤其是在科技文献方面更为适用。实验结果表明,当数据量较大时,该方法结合改进后的K-means算法在科技文献分类方面有较高的性能。
根據科技文獻的結構特點,搭建瞭一箇四層挖掘模式,提齣瞭一種應用于科技文獻分類的文本特徵選擇方法。該方法首先依據科技文獻的結構將其分為四箇層次,然後採用K-means聚類對前三層逐層實現特徵詞提取,最後再使用Aprori算法找齣第四層的最大頻繁項集,併作為第四層的特徵詞集閤。在該方法中,針對K-means算法受初始中心點的影響較大的問題,首先採用信息熵對聚類對象賦權的方式來脩正對象間的距離函數,然後再利用初始聚類的賦權函數值選齣較閤適的初始聚類中心點。同時,通過為K-means算法的終止條件設定標準值,來減少算法迭代次數,以減少學習時間;通過刪除由信息動態變化而產生的冗餘信息,來減少動態聚類過程中的榦擾,從而使算法達到更準確更高效的聚類效果。上述措施使得該文本特徵選擇方法能夠在文獻語料庫中更加準確地找到特徵詞,較之以前的方法有很大提升,尤其是在科技文獻方麵更為適用。實驗結果錶明,噹數據量較大時,該方法結閤改進後的K-means算法在科技文獻分類方麵有較高的性能。
근거과기문헌적결구특점,탑건료일개사층알굴모식,제출료일충응용우과기문헌분류적문본특정선택방법。해방법수선의거과기문헌적결구장기분위사개층차,연후채용K-means취류대전삼층축층실현특정사제취,최후재사용Aprori산법조출제사층적최대빈번항집,병작위제사층적특정사집합。재해방법중,침대K-means산법수초시중심점적영향교대적문제,수선채용신식적대취류대상부권적방식래수정대상간적거리함수,연후재이용초시취류적부권함수치선출교합괄적초시취류중심점。동시,통과위K-means산법적종지조건설정표준치,래감소산법질대차수,이감소학습시간;통과산제유신식동태변화이산생적용여신식,래감소동태취류과정중적간우,종이사산법체도경준학경고효적취류효과。상술조시사득해문본특정선택방법능구재문헌어료고중경가준학지조도특정사,교지이전적방법유흔대제승,우기시재과기문헌방면경위괄용。실험결과표명,당수거량교대시,해방법결합개진후적K-means산법재과기문헌분류방면유교고적성능。
By means of a four-mining model which is constructed based on the structural characteristics of scientific liter-atures, a text feature selection method is proposed to apply in classification of scientific literatures. The proposed method firstly divides scientific literature into four layers according to its structure, and then selects features progressively for the former three layers by K-means algorithm, and finally finds out the maximum frequent itemsets of fourth layer by Aprori algorithm to act as a collection of fourth layer features. Meanwhile, K-means algorithm is also improved which firstly uses information entropy empower the clustering objects to correct the distance function, and then employs empowerment func-tion value to select the optimal initial clustering center, and subsequently reduces algorithm iterations and learning time by setting the standard value for termination condition of the algorithm and reduces interference of dynamic clustering by removing redundant information from the changing information to make the algorithm achieve more accurate and efficient clustering effect. So, it is possible for this proposed method to find features more accurately in the literature corpus. Exper-imental results show that the proposed method is feasible and effective, and has higher performance in scientific litera-ture classification which is compared with the previous methods.