CAJ | 학술논문

随着信息时代的来临，互联网产生的大规模高维数据呈现几何级数增长，对其进行谱聚类在计算时间和内存使用上都存在瓶颈问题，尤其是求Laplacian矩阵特征向量分解。鉴于Hadoop MapReduce并行编程模型对密集型数据处理的优势，基于t最近邻稀疏化近似相似Laplacian矩阵，设计Hadoop MapReduce并行近似谱聚类算法，以期解决上述瓶颈问题。实验使用UCI Bag of Words数据集验证所设计算法的正确性和有效性，结果显示该并行设计在谱聚类质量和性能方面达到了一定的预期效果。
수착신식시대적래림，호련망산생적대규모고유수거정현궤하급수증장，대기진행보취류재계산시간화내존사용상도존재병경문제，우기시구Laplacian구진특정향량분해。감우Hadoop MapReduce병행편정모형대밀집형수거처리적우세，기우t최근린희소화근사상사Laplacian구진，설계Hadoop MapReduce병행근사보취류산법，이기해결상술병경문제。실험사용UCI Bag of Words수거집험증소설계산법적정학성화유효성，결과현시해병행설계재보취류질량화성능방면체도료일정적예기효과。
With the advent of information age, the large-scale high-dimensional data generated in Internet increases exponentially, its spectral clustering suffers from the bottleneck problem in both computational time and memory use, particularly in solving Laplacian matrix eigenvector decomposition.Given the advantages of Hadoop MapReduce parallel programming model in processing intensive data, based on t nearest neighbour sparse approximation similarity Laplacian matrix, in this paper we design Hadoop MapReduce parallel approximate spectral clustering algorithm to solve the above-mentioned bottleneck problem.The experiment uses UCI Bag of Words dataset to validate the correctness and effectiveness of the designed algorithm, result indicates that the parallel design aligns with a certain desired effect in terms of spectral clustering quality and performance.