模式识别与人工智能
模式識彆與人工智能
모식식별여인공지능
Moshi Shibie yu Rengong Zhineng
2013年
9期
845-852
,共8页
吐尔地·托合提%艾克白尔·帕塔尔%艾斯卡尔·艾木都拉
吐爾地·託閤提%艾剋白爾·帕塔爾%艾斯卡爾·艾木都拉
토이지·탁합제%애극백이·파탑이%애사잡이·애목도랍
维吾尔文切分%互信息%t-测试差%邻接对熵%无监督特征选择
維吾爾文切分%互信息%t-測試差%鄰接對熵%無鑑督特徵選擇
유오이문절분%호신식%t-측시차%린접대적%무감독특정선택
Uyghur Segmentation%Mutual Information%Difference of t-Test%Entropy of Adjacency%Unsupervised Feature Selection
维吾尔文常用切分方法会产生大量的语义抽象甚至多义的词特征,因此学习算法难以发现高维数据中隐藏的结构.提出一种无监督切分方法dme-TS和一种无监督特征选择方法UMRMR-UFS. dme-TS从大规模生语料中自动获取单词Bi-gram及上下文语境信息,并将相邻单词间的t-测试差、互信息及双词上下文邻接对熵的线性融合作为一个组合统计量( dme)来评价单词间的结合能力,从而将文本切分成语义具体的独立语言单位的特征集合. UMRMR-UFS用一种综合考虑最大相关度和最小冗余的无监督特征选择标准( UMRMR)来评价每一个特征的重要性,并将最重要的特征依次移入到特征子集中.实验结果表明dme-TS能有效控制原始特征集的规模,提高特征项本身的质量,用UMRMR-UFS的输出来表征文本时,学习算法也表现出其最高的性能.
維吾爾文常用切分方法會產生大量的語義抽象甚至多義的詞特徵,因此學習算法難以髮現高維數據中隱藏的結構.提齣一種無鑑督切分方法dme-TS和一種無鑑督特徵選擇方法UMRMR-UFS. dme-TS從大規模生語料中自動穫取單詞Bi-gram及上下文語境信息,併將相鄰單詞間的t-測試差、互信息及雙詞上下文鄰接對熵的線性融閤作為一箇組閤統計量( dme)來評價單詞間的結閤能力,從而將文本切分成語義具體的獨立語言單位的特徵集閤. UMRMR-UFS用一種綜閤攷慮最大相關度和最小冗餘的無鑑督特徵選擇標準( UMRMR)來評價每一箇特徵的重要性,併將最重要的特徵依次移入到特徵子集中.實驗結果錶明dme-TS能有效控製原始特徵集的規模,提高特徵項本身的質量,用UMRMR-UFS的輸齣來錶徵文本時,學習算法也錶現齣其最高的性能.
유오이문상용절분방법회산생대량적어의추상심지다의적사특정,인차학습산법난이발현고유수거중은장적결구.제출일충무감독절분방법dme-TS화일충무감독특정선택방법UMRMR-UFS. dme-TS종대규모생어료중자동획취단사Bi-gram급상하문어경신식,병장상린단사간적t-측시차、호신식급쌍사상하문린접대적적선성융합작위일개조합통계량( dme)래평개단사간적결합능력,종이장문본절분성어의구체적독립어언단위적특정집합. UMRMR-UFS용일충종합고필최대상관도화최소용여적무감독특정선택표준( UMRMR)래평개매일개특정적중요성,병장최중요적특정의차이입도특정자집중.실험결과표명dme-TS능유효공제원시특정집적규모,제고특정항본신적질량,용UMRMR-UFS적수출래표정문본시,학습산법야표현출기최고적성능.
Commonly used Uyghur segmentation method produces a large number of semantic abstraction and even polysemous word features, so learning algorithms are difficult to find the hidden structure in the high-dimensional data. A segmentation approach dme-TS and a feature selection approach UMRMR-UFS based on unsupervised strategy are proposed. In dme-TS, the word based Bi-gram and contextual information are derived from large scale raw text corpus automatically, and the liner combinations of difference of t-test, mutual information and entropy of double word adjacency are taken as a measurement ( dme) to estimate the agglutinative strength between two adjacent Uyghur words. In UMRMR-UFS, an improved unsupervised feature selection criterion ( UMRMR ) is proposed and the importance of each feature is estimated according to its minimum redundancy and maximum relevancy. The experimental result shows that dme-TS effectively reduces the dimensions of original feature set and improves the quality of the feature itself, and the learning algorithm represents its highest performance on the feature subset selected by UMRMR-UFS.