山东农业大学学报(自然科学版)
山東農業大學學報(自然科學版)
산동농업대학학보(자연과학판)
JOURNAL OF SHANDONG AGRICULTURAL UNIVERSITY(NATURAL SCIENCE)
2014年
2期
216-222
,共7页
短信分类%kNN算法%特征向量集%向量空间模型
短信分類%kNN算法%特徵嚮量集%嚮量空間模型
단신분류%kNN산법%특정향량집%향량공간모형
SMS classification%k-nearest neighbor algorithm%feature vectors set%vector space model
研究并实现了kNN算法的手机短信客户端分类系统,从自建的短信语料库中提取到正常短信和垃圾短信两个特征向量集,通过预处理、降维和去除词频过小的特征项,使特征向量集可最大程度的载有该类短信的特征项。短信语料库分成比对库和测试库两部分。研究发现,比对库的短信数量n取600时分类效果最好,过小则降低短信的识别率,过大则提升分类时间复杂度,近邻数k取25时效果最优。同时研究了当k条短信选取时的概率差在1%~2%时,短信类别确定时的数量差在5到15之间时,效果最优。遵循保证正常短信的通过率的同时加大垃圾短信识别率的原则,kNN算法手机短信客户端分类系统的最终参数n取600,k取25,概率差取1.5%,数量差取9,可使得正常短信和垃圾短信识别率最高达到97.3%和89%。
研究併實現瞭kNN算法的手機短信客戶耑分類繫統,從自建的短信語料庫中提取到正常短信和垃圾短信兩箇特徵嚮量集,通過預處理、降維和去除詞頻過小的特徵項,使特徵嚮量集可最大程度的載有該類短信的特徵項。短信語料庫分成比對庫和測試庫兩部分。研究髮現,比對庫的短信數量n取600時分類效果最好,過小則降低短信的識彆率,過大則提升分類時間複雜度,近鄰數k取25時效果最優。同時研究瞭噹k條短信選取時的概率差在1%~2%時,短信類彆確定時的數量差在5到15之間時,效果最優。遵循保證正常短信的通過率的同時加大垃圾短信識彆率的原則,kNN算法手機短信客戶耑分類繫統的最終參數n取600,k取25,概率差取1.5%,數量差取9,可使得正常短信和垃圾短信識彆率最高達到97.3%和89%。
연구병실현료kNN산법적수궤단신객호단분류계통,종자건적단신어료고중제취도정상단신화랄급단신량개특정향량집,통과예처리、강유화거제사빈과소적특정항,사특정향량집가최대정도적재유해류단신적특정항。단신어료고분성비대고화측시고량부분。연구발현,비대고적단신수량n취600시분류효과최호,과소칙강저단신적식별솔,과대칙제승분류시간복잡도,근린수k취25시효과최우。동시연구료당k조단신선취시적개솔차재1%~2%시,단신유별학정시적수량차재5도15지간시,효과최우。준순보증정상단신적통과솔적동시가대랄급단신식별솔적원칙,kNN산법수궤단신객호단분류계통적최종삼수n취600,k취25,개솔차취1.5%,수량차취9,가사득정상단신화랄급단신식별솔최고체도97.3%화89%。
This paper studied and realized the SMS client classification system based on kNN algorithm and extracted two feature vectors set of the normal and spam SMS from the self-built SMS corpus, and made the feature vectors set get the feature item of the SMS to the maximum extent through the pretreatment, reducing dimension and removing the smaller frequency feature items. The study showed that the classification effect was the best when n was took 600,the SMS recognition rate reduced when n was too small, the classification time complexity enhanced when n too large, the optimum was neighbor number k to be took 25. At the meantime,the optimum effect was performed when the probability discrepancy of k SMS between 1%and 2%, and number discrepancy of which between 5 and 15. The recognition rate of normal and spam SMS was up to 97.3%and 89%when the final classification system parameter n was took 600, k was took 25,probability difference 1.5%,discrepancy number was took 9 to ensure the better normal SMS pass rate and spam SMS recognition rate.