计算机科学与探索
計算機科學與探索
계산궤과학여탐색
Journal of Frontiers of Computer Science & Technology
2015年
11期
1301-1313
,共13页
聚类%时间序列%k-means算法
聚類%時間序列%k-means算法
취류%시간서렬%k-means산법
clustering%time series%k-means algorithm
时间序列的聚类算法是分析预测互联网搜索对象搜索指数和社交网络话题热度随时间变化趋势的重要过程,但目前时间序列聚类算法的研究存在两点不足:首先国内外的时间序列聚类的研究都采用等长划分的时间序列,这往往会丢失许多重要特征点,对数据挖掘结果产生一定的负面影响;其次直接使用时间序列观测值不能准确地度量时间序列的形状相似度.因此,通过标准分数z_score预处理消除了时间序列观测值数量级差异的影响,并设计了基于滑窗的不等长时间序列STS(short time series)距离和类k-means聚类算法的中心曲线计算方法,最终提出了基于滑窗不等长时间序列STS距离的聚类算法,从而解决了不等长时间序列聚类问题.采集互联网上的真实数据集作为测试样本,并进行了大量实验.实验结果表明,基于滑窗不等长时间序列STS距离的聚类算法不仅消除了时间序列观测值数量级差异的影响,解决了不等长时间序列聚类问题,并且比现有算法取得了更优的聚类效果.
時間序列的聚類算法是分析預測互聯網搜索對象搜索指數和社交網絡話題熱度隨時間變化趨勢的重要過程,但目前時間序列聚類算法的研究存在兩點不足:首先國內外的時間序列聚類的研究都採用等長劃分的時間序列,這往往會丟失許多重要特徵點,對數據挖掘結果產生一定的負麵影響;其次直接使用時間序列觀測值不能準確地度量時間序列的形狀相似度.因此,通過標準分數z_score預處理消除瞭時間序列觀測值數量級差異的影響,併設計瞭基于滑窗的不等長時間序列STS(short time series)距離和類k-means聚類算法的中心麯線計算方法,最終提齣瞭基于滑窗不等長時間序列STS距離的聚類算法,從而解決瞭不等長時間序列聚類問題.採集互聯網上的真實數據集作為測試樣本,併進行瞭大量實驗.實驗結果錶明,基于滑窗不等長時間序列STS距離的聚類算法不僅消除瞭時間序列觀測值數量級差異的影響,解決瞭不等長時間序列聚類問題,併且比現有算法取得瞭更優的聚類效果.
시간서렬적취류산법시분석예측호련망수색대상수색지수화사교망락화제열도수시간변화추세적중요과정,단목전시간서렬취류산법적연구존재량점불족:수선국내외적시간서렬취류적연구도채용등장화분적시간서렬,저왕왕회주실허다중요특정점,대수거알굴결과산생일정적부면영향;기차직접사용시간서렬관측치불능준학지도량시간서렬적형상상사도.인차,통과표준분수z_score예처리소제료시간서렬관측치수량급차이적영향,병설계료기우활창적불등장시간서렬STS(short time series)거리화류k-means취류산법적중심곡선계산방법,최종제출료기우활창불등장시간서렬STS거리적취류산법,종이해결료불등장시간서렬취류문제.채집호련망상적진실수거집작위측시양본,병진행료대량실험.실험결과표명,기우활창불등장시간서렬STS거리적취류산법불부소제료시간서렬관측치수량급차이적영향,해결료불등장시간서렬취류문제,병차비현유산법취득료경우적취류효과.
Time series clustering is an important algorithm widely used by many applications, such as the analysis and forecast of topics on social media and search words on search engine. However, existing time series clustering algo-rithms suffer from two shortcomings. Firstly, time series clustering algorithms mostly work only for isometric time series with equal length, leading to the loss of many important features and negative impact of clustering results. Secondly, time series similarity metrics are not able to compare the shape similarity of time series. To address the problems, this paper proposes a novel computation framework to cluster time series data with non-equal length. At first, this paper uses z_score standardization to normalize the observed values of time series data. Next, based on sliding window, this paper extends STS (short time series) distance and designs a new distance measure for time series with non-equal time length. After that, this paper adapts the classic k-means algorithm to develop a new clustering algorithm. The extensive experimental results, by two real datasets that are collected from search engines and public data, success-fully verify that the proposed time series clustering algorithm can handle non-equal time series data and outperform the state of arts in terms of clustering accuracy and quality.