计算机科学技术学报(英文版)
計算機科學技術學報(英文版)
계산궤과학기술학보(영문판)
COMPUTER JOURNAL OF SCIENCE AND TECHNOLOGY
2002年
5期
611-624
,共14页
clustering%categorical data%data stream%data mining
This paper presents a new efficient algorithm for clustering categorical data,Squeezer, which can produce high quality clustering results and at the same time deservegood scalability. The Squeezer algorithm reads each tuple t in sequence, either assigning tto an existing cluster (initially none), or creating t as a new cluster, which is determined bythe similarities between t and clusters. Due to its characteristics, the proposed algorithm isextremely suitable for clustering data streams, where given a sequence of points, the objective isto maintain consistently good clustering of the sequence so far, using a small amount of memoryand time. Outliers can also be handled efficiently and directly in Squeezer. Experimental resultson real-life and synthetic datasets verify the superiority of Squeezer.