软件学报
軟件學報
연건학보
JOURNAL OF SOFTWARE
2014年
5期
997-1013
,共17页
何萍%徐晓华%陆林%陈崚
何萍%徐曉華%陸林%陳崚
하평%서효화%륙림%진릉
半监督聚类%点对约束%随机游走%组件%影响扩散
半鑑督聚類%點對約束%隨機遊走%組件%影響擴散
반감독취류%점대약속%수궤유주%조건%영향확산
semi-supervised clustering%pairwise constraint%random walk%component%influence expansion
半监督聚类旨在根据用户给出的必连和不连约束,把所有数据点划分到不同的簇中,从而获得更准确、更加符合用户要求的聚类结果.目前的半监督聚类算法大多数通过修改已有的聚类算法或者结合度规学习,使聚类结果与点对约束尽可能地保持一致,却很少考虑点对约束对周围无约束数据的显式影响程度.提出一种由在顶点上的低层随机游走和在组件上的高层随机游走两部分构成的双层随机游走半监督聚类算法,其中,低层随机游走主要负责计算选出的约束顶点对其他顶点的影响范围和影响程度,称为组件;高层随机游走则进一步将各个点对约束以自适应的强度在组件上进行约束传播,把它们在每个顶点上的影响综合在一个簇指示矩阵中.UCI 数据集和大型真实数据集上的实验结果表明,双层随机游走半监督聚类算法比其他半监督聚类算法更准确,也比较高效.
半鑑督聚類旨在根據用戶給齣的必連和不連約束,把所有數據點劃分到不同的簇中,從而穫得更準確、更加符閤用戶要求的聚類結果.目前的半鑑督聚類算法大多數通過脩改已有的聚類算法或者結閤度規學習,使聚類結果與點對約束儘可能地保持一緻,卻很少攷慮點對約束對週圍無約束數據的顯式影響程度.提齣一種由在頂點上的低層隨機遊走和在組件上的高層隨機遊走兩部分構成的雙層隨機遊走半鑑督聚類算法,其中,低層隨機遊走主要負責計算選齣的約束頂點對其他頂點的影響範圍和影響程度,稱為組件;高層隨機遊走則進一步將各箇點對約束以自適應的彊度在組件上進行約束傳播,把它們在每箇頂點上的影響綜閤在一箇簇指示矩陣中.UCI 數據集和大型真實數據集上的實驗結果錶明,雙層隨機遊走半鑑督聚類算法比其他半鑑督聚類算法更準確,也比較高效.
반감독취류지재근거용호급출적필련화불련약속,파소유수거점화분도불동적족중,종이획득경준학、경가부합용호요구적취류결과.목전적반감독취류산법대다수통과수개이유적취류산법혹자결합도규학습,사취류결과여점대약속진가능지보지일치,각흔소고필점대약속대주위무약속수거적현식영향정도.제출일충유재정점상적저층수궤유주화재조건상적고층수궤유주량부분구성적쌍층수궤유주반감독취류산법,기중,저층수궤유주주요부책계산선출적약속정점대기타정점적영향범위화영향정도,칭위조건;고층수궤유주칙진일보장각개점대약속이자괄응적강도재조건상진행약속전파,파타문재매개정점상적영향종합재일개족지시구진중.UCI 수거집화대형진실수거집상적실험결과표명,쌍층수궤유주반감독취류산법비기타반감독취류산법경준학,야비교고효.
Semi-Supervised clustering aims to partition the data points into different clusters based on the user-specified must-link and cannot-link constraints. The current semi-supervised clustering algorithms either modify the clustering methods or combine the metric learning approaches to adapt the clustering result as consistent with the pairwise constraints as possible. However, few of them try to explicitly compute the degrees of influence that each pairwise constraint exerts on the unconstrained data points. This paper proposes a semi-supervised clustering algorithm via a two-level random walk, which is composed of a lower-level random walk on vertices and a higher-level random walk on components. The lower-level random walk is responsible for computing the influence range of every vertex constrained by a pairwise constraint. This information is encapsulated in an intermediate structure called “component”. The higher-level random walk further propagates the pairwise constraints on the components with adaptive strength, followed by the integration of all the constraint influence into a cluster indicating matrix. The experiments on UCI database and large real-world data sets demonstrate that, compared with other semi-supervised clustering algorithms, the proposed method not only produces more satisfactory clustering results but also exhibits good efficiency.