CAJ | 학술논문

维吾尔文常用切分方法会产生大量的语义抽象甚至多义的词特征，因此学习算法难以发现高维数据中隐藏的结构.提出一种无监督切分方法dme-TS和一种无监督特征选择方法UMRMR-UFS. dme-TS从大规模生语料中自动获取单词Bi-gram及上下文语境信息，并将相邻单词间的t-测试差、互信息及双词上下文邻接对熵的线性融合作为一个组合统计量( dme)来评价单词间的结合能力，从而将文本切分成语义具体的独立语言单位的特征集合. UMRMR-UFS用一种综合考虑最大相关度和最小冗余的无监督特征选择标准( UMRMR)来评价每一个特征的重要性，并将最重要的特征依次移入到特征子集中.实验结果表明dme-TS能有效控制原始特征集的规模，提高特征项本身的质量，用UMRMR-UFS的输出来表征文本时，学习算法也表现出其最高的性能.
유오이문상용절분방법회산생대량적어의추상심지다의적사특정，인차학습산법난이발현고유수거중은장적결구.제출일충무감독절분방법dme-TS화일충무감독특정선택방법UMRMR-UFS. dme-TS종대규모생어료중자동획취단사Bi-gram급상하문어경신식，병장상린단사간적t-측시차、호신식급쌍사상하문린접대적적선성융합작위일개조합통계량( dme)래평개단사간적결합능력，종이장문본절분성어의구체적독립어언단위적특정집합. UMRMR-UFS용일충종합고필최대상관도화최소용여적무감독특정선택표준( UMRMR)래평개매일개특정적중요성，병장최중요적특정의차이입도특정자집중.실험결과표명dme-TS능유효공제원시특정집적규모，제고특정항본신적질량，용UMRMR-UFS적수출래표정문본시，학습산법야표현출기최고적성능.
Commonly used Uyghur segmentation method produces a large number of semantic abstraction and even polysemous word features, so learning algorithms are difficult to find the hidden structure in the high-dimensional data. A segmentation approach dme-TS and a feature selection approach UMRMR-UFS based on unsupervised strategy are proposed. In dme-TS, the word based Bi-gram and contextual information are derived from large scale raw text corpus automatically, and the liner combinations of difference of t-test, mutual information and entropy of double word adjacency are taken as a measurement ( dme) to estimate the agglutinative strength between two adjacent Uyghur words. In UMRMR-UFS, an improved unsupervised feature selection criterion ( UMRMR ) is proposed and the importance of each feature is estimated according to its minimum redundancy and maximum relevancy. The experimental result shows that dme-TS effectively reduces the dimensions of original feature set and improves the quality of the feature itself, and the learning algorithm represents its highest performance on the feature subset selected by UMRMR-UFS.