中文信息学报
中文信息學報
중문신식학보
JOURNAL OF CHINESE INFORMAITON PROCESSING
2010年
3期
81-88
,共8页
曾依灵%许洪波%吴高巍%程学旗%白硕
曾依靈%許洪波%吳高巍%程學旂%白碩
증의령%허홍파%오고외%정학기%백석
计算机应用%中文信息处理%文本聚类%空间映射%尺度变换%模型不匹配
計算機應用%中文信息處理%文本聚類%空間映射%呎度變換%模型不匹配
계산궤응용%중문신식처리%문본취류%공간영사%척도변환%모형불필배
computer application%Chinese information processing%document clustering%space mapping%rescaling%model misfit
传统聚类算法通常建立在显式的模型之上,很少考虑泛化模型以适应不同的数据,由此导致了模型不匹配问题.针对此问题,该文提出了一种基于空间映射(Mapping)及尺度变换(Rescaling)的聚类框架(简称M-R框架).具体而言,M-R框架首先将语料映射到一组具有良好区分度的方向所构建的坐标系中,以统计各个簇的分布特性,然后根据这些分布特性对各个坐标轴进行尺度变换,以归一化语料中各个类簇的分布.如上两步操作伴随算法迭代执行,直至算法收敛.该文将M-R框架应用到K-means算法及谱聚类算法上以验证其性能,在国际标准评测语料上的实验表明,应用了M-R框架的K-means及谱聚类在所有语料集上获得了全面的性能提升.
傳統聚類算法通常建立在顯式的模型之上,很少攷慮汎化模型以適應不同的數據,由此導緻瞭模型不匹配問題.針對此問題,該文提齣瞭一種基于空間映射(Mapping)及呎度變換(Rescaling)的聚類框架(簡稱M-R框架).具體而言,M-R框架首先將語料映射到一組具有良好區分度的方嚮所構建的坐標繫中,以統計各箇簇的分佈特性,然後根據這些分佈特性對各箇坐標軸進行呎度變換,以歸一化語料中各箇類簇的分佈.如上兩步操作伴隨算法迭代執行,直至算法收斂.該文將M-R框架應用到K-means算法及譜聚類算法上以驗證其性能,在國際標準評測語料上的實驗錶明,應用瞭M-R框架的K-means及譜聚類在所有語料集上穫得瞭全麵的性能提升.
전통취류산법통상건립재현식적모형지상,흔소고필범화모형이괄응불동적수거,유차도치료모형불필배문제.침대차문제,해문제출료일충기우공간영사(Mapping)급척도변환(Rescaling)적취류광가(간칭M-R광가).구체이언,M-R광가수선장어료영사도일조구유량호구분도적방향소구건적좌표계중,이통계각개족적분포특성,연후근거저사분포특성대각개좌표축진행척도변환,이귀일화어료중각개류족적분포.여상량보조작반수산법질대집행,직지산법수렴.해문장M-R광가응용도K-means산법급보취류산법상이험증기성능,재국제표준평측어료상적실험표명,응용료M-R광가적K-means급보취류재소유어료집상획득료전면적성능제승.
Traditional clustering algorithms suffer from model mismatch problem when the distribution of real data does not fit the model assumptions. To address this problem, a mapping and rescaling framework (referred as M-R framework) is proposed for document clustering. Specifically, documents are first mapped into a discriminative coordinate so that the distribution statistics of each cluster could be analyzed on the corresponding dimension. With the statistics obtained, a rescaling operation is then applied to normalize the data distribution based on the model assumptions. These two steps are conducted iteratively along with the clustering algorithm to improve the clustering performance. In the experiment, the M-R framework is applied on traditional k-means and the state-of-art spectral clustering algorithm Ncut. Resultss on well known datasets show that M-R framework brings performance improvements in all datasets.