陕西师范大学学报(自然科学版)
陝西師範大學學報(自然科學版)
협서사범대학학보(자연과학판)
Journal of Shaanxi Normal University (Natural Science Edition)
2015年
6期
1-8
,共8页
F统计量%内部评价指标%类簇数%K-me doids聚类算法%最小生成树
F統計量%內部評價指標%類簇數%K-me doids聚類算法%最小生成樹
F통계량%내부평개지표%류족수%K-me doids취류산법%최소생성수
F-statistics%internal evaluation criterion%the number of clusters%K-me doids cluste-ring algorithm%minimum spanning tree
用于发现数据集类簇数 k 的常用内部评价指标 DB(Davies Bouldin)和 BWP(Between-within Proportion)等需要先确定一个搜索范围 kmax ,使数据集的类簇数满足 k≤kmax,但如何确定 k max尚无理论指导。针对这一问题,提出一个新 F 统计量 Fr ,将 Fr 作为新聚类有效性准则,以判断聚类算法收敛与否,自适应地确定数据集类簇数;将 Fr 应用于快速 K-medoids 算法的收敛性判断,并以基于最小生成树的测地距离,即样本对在最小生成树上的路径长度,代替其间的直接欧氏距离度量样本相似性,得到一种自适应的快速 K-medoids 聚类算法,解决了 K-medoids 算法需要人为给定类簇数和不能发现任意形状簇的问题。UCI 机器学习数据库数据集和人工模拟数据集实验测试表明,本文提出的Fr 指标是一种有效的聚类算法评价指标,基于该指标和测地距离的 K-medoids 算法不仅能发现任意形状的簇,还可以自适应地确定数据集的类簇数,且对噪音数据有很好的鲁棒性。
用于髮現數據集類簇數 k 的常用內部評價指標 DB(Davies Bouldin)和 BWP(Between-within Proportion)等需要先確定一箇搜索範圍 kmax ,使數據集的類簇數滿足 k≤kmax,但如何確定 k max尚無理論指導。針對這一問題,提齣一箇新 F 統計量 Fr ,將 Fr 作為新聚類有效性準則,以判斷聚類算法收斂與否,自適應地確定數據集類簇數;將 Fr 應用于快速 K-medoids 算法的收斂性判斷,併以基于最小生成樹的測地距離,即樣本對在最小生成樹上的路徑長度,代替其間的直接歐氏距離度量樣本相似性,得到一種自適應的快速 K-medoids 聚類算法,解決瞭 K-medoids 算法需要人為給定類簇數和不能髮現任意形狀簇的問題。UCI 機器學習數據庫數據集和人工模擬數據集實驗測試錶明,本文提齣的Fr 指標是一種有效的聚類算法評價指標,基于該指標和測地距離的 K-medoids 算法不僅能髮現任意形狀的簇,還可以自適應地確定數據集的類簇數,且對譟音數據有很好的魯棒性。
용우발현수거집류족수 k 적상용내부평개지표 DB(Davies Bouldin)화 BWP(Between-within Proportion)등수요선학정일개수색범위 kmax ,사수거집적류족수만족 k≤kmax,단여하학정 k max상무이론지도。침대저일문제,제출일개신 F 통계량 Fr ,장 Fr 작위신취류유효성준칙,이판단취류산법수렴여부,자괄응지학정수거집류족수;장 Fr 응용우쾌속 K-medoids 산법적수렴성판단,병이기우최소생성수적측지거리,즉양본대재최소생성수상적로경장도,대체기간적직접구씨거리도량양본상사성,득도일충자괄응적쾌속 K-medoids 취류산법,해결료 K-medoids 산법수요인위급정류족수화불능발현임의형상족적문제。UCI 궤기학습수거고수거집화인공모의수거집실험측시표명,본문제출적Fr 지표시일충유효적취류산법평개지표,기우해지표화측지거리적 K-medoids 산법불부능발현임의형상적족,환가이자괄응지학정수거집적류족수,차대조음수거유흔호적로봉성。
The internal evaluation criteria,such as DB and BWP et al,need the upper bound of k max to be given in advance when these criteria are used to determine the number of clusters of a dataset,so that the number of clusters of k in a dataset satisfies k≤k max.However,there is not so far a theory about how to find the upper bound of k max .Aiming at this problem,a new F-sta-tistics Fr is proposed in this paper.Fr can be used as a new criterion to judge whether a clustering algorithm is converged or not,so that the number of clusters of a dataset can be found automati-cally without the upper bound of k max to be given in advance.In addition,F r is applied to the fast K-medoids clustering algorithm as its criterion to judge whether its iterations continues or not, and the geodesic distance between samples on the minimum spanning tree (MST),that is the length of the path between samples on MST,is introduced to instead of the direct Euclidean dis-tance between them to measure their similarities,so that an adaptive fast K-medoids algorithm is obtained.The adaptive K-medoids algorithm avoids the deficiencies of the available K-medoids al-gorithms which need the number of clusters to be given in advance and cannot detect the clusters with any arbitrary shape.The power of the proposed criterion Fr and the power of the adaptive fast K-medoids algorithm are tested in real world datasets from UCI machine learning repository and in the synthetic datasets.All the experimental results demonstrate that our proposed criteri-on Fr is a valid clustering criterion,and the adaptive fast K-medoids algorithm based on the F r and geodesic distance can recognize the clusters with arbitrary shape,and can detect the number of clusters automatically,and is robust to noises as well.