软件学报
軟件學報
연건학보
JOURNAL OF SOFTWARE
2014年
9期
2037-2049
,共13页
大数据%谱聚类%特征分解%Nystr?m扩展%自适应采样
大數據%譜聚類%特徵分解%Nystr?m擴展%自適應採樣
대수거%보취류%특정분해%Nystr?m확전%자괄응채양
big data%spectral clustering%eigen-decomposition%Nystr?m extension%adaptive sampling
面对结构复杂的数据集,谱聚类是一种灵活而有效的聚类方法,它基于谱图理论,通过将数据点映射到一个由特征向量构成的低维空间,优化数据的结构,得到令人满意的聚类结果.但在谱聚类的过程中,特征分解的计算复杂度通常为O(n3),限制了谱聚类算法在大数据中的应用.Nystr?m扩展方法利用数据集中的部分抽样点,进行近似计算,逼近真实的特征空间,可以有效降低计算复杂度,为大数据谱聚类算法提供了新思路.抽样策略的选择对Nystr?m 扩展技术至关重要,设计了一种自适应的 Nystr?m 采样方法,每个数据点的抽样概率都会在一次采样完成后及时更新,而且从理论上证明了抽样误差会随着采样次数的增加呈指数下降.基于自适应的Nystr?m采样方法,提出一种适用于大数据的谱聚类算法,并对该算法的可行性和有效性进行了实验验证.
麵對結構複雜的數據集,譜聚類是一種靈活而有效的聚類方法,它基于譜圖理論,通過將數據點映射到一箇由特徵嚮量構成的低維空間,優化數據的結構,得到令人滿意的聚類結果.但在譜聚類的過程中,特徵分解的計算複雜度通常為O(n3),限製瞭譜聚類算法在大數據中的應用.Nystr?m擴展方法利用數據集中的部分抽樣點,進行近似計算,逼近真實的特徵空間,可以有效降低計算複雜度,為大數據譜聚類算法提供瞭新思路.抽樣策略的選擇對Nystr?m 擴展技術至關重要,設計瞭一種自適應的 Nystr?m 採樣方法,每箇數據點的抽樣概率都會在一次採樣完成後及時更新,而且從理論上證明瞭抽樣誤差會隨著採樣次數的增加呈指數下降.基于自適應的Nystr?m採樣方法,提齣一種適用于大數據的譜聚類算法,併對該算法的可行性和有效性進行瞭實驗驗證.
면대결구복잡적수거집,보취류시일충령활이유효적취류방법,타기우보도이론,통과장수거점영사도일개유특정향량구성적저유공간,우화수거적결구,득도령인만의적취류결과.단재보취류적과정중,특정분해적계산복잡도통상위O(n3),한제료보취류산법재대수거중적응용.Nystr?m확전방법이용수거집중적부분추양점,진행근사계산,핍근진실적특정공간,가이유효강저계산복잡도,위대수거보취류산법제공료신사로.추양책략적선택대Nystr?m 확전기술지관중요,설계료일충자괄응적 Nystr?m 채양방법,매개수거점적추양개솔도회재일차채양완성후급시경신,이차종이론상증명료추양오차회수착채양차수적증가정지수하강.기우자괄응적Nystr?m채양방법,제출일충괄용우대수거적보취류산법,병대해산법적가행성화유효성진행료실험험증.
Spectral clustering is a flexible and effective clustering method for complex structure data sets. It is based on spectral graph theory and can produce satisfactory clustering results by mapping the data points into a low-dimensional space constituted by eigenvectors so that the data structure is optimized. But in the process of spectral clustering, the computational complexity of eigen-decomposition is usually O(n3), which limits the application of spectral clustering algorithm in big data problems. Nystr?m extension method uses partial points sampled from the data set and approximate calculation to simulate the real eigenspace. In this way, the computational complexity can be effectively reduced, which provides a new idea for big data spectral clustering algorithm. The selection of sampling strategy is essential for Nystr?m extension technology. In this paper, the design of an adaptive Nystr?m sampling method is presented. The sampling probability of every data point will be updated after each sampling pass, and a proof is given that the sampling error will decrease exponentially with the increase of sample times. Based on the adaptive Nystr?m sampling method, a spectral clustering algorithm for big data analysis is presented, and its feasibility and effectiveness is verified by experiments.