CAJ | 학술논문

针对高维小样本癌症基因数据集的有效区分基因子集选择难题,提出基于统计相关性和K-means的新颖混合基因选择算法实现有效区分基因子集选择。算法首先采用Pearson相关系数和Wilcoxon秩和检验计算各基因与类标的相关性,根据统计相关性原则选取与类标相关性较大的若干基因构成预选择基因子集；然后,采用K-means算法将预选择基因子集中高度相关的基因聚集到同一类簇,训练 SVM 分类模型,计算每一个基因的权重,从每一类簇选择一个权重最大或者采用轮盘赌思想从每一类簇选择一个得票数最多的基因作为本类簇的代表基因,各类簇的代表基因构成有效区分基因子集。将该算法与采用随机策略选择各类簇代表基因的随机基因选择算法 Random, Guyon的经典基因选择算法SVM-RFE、采用顺序前向搜索策略的基因选择算法SVM-SFS进行实验比较,几个经典基因数据集上的200次重复实验的平均实验结果表明：所提出的混合基因选择算法能够选择到区分性能非常好的基因子集,建立在该区分基因子集上的分类器具有非常好的分类性能。
침대고유소양본암증기인수거집적유효구분기인자집선택난제,제출기우통계상관성화K-means적신영혼합기인선택산법실현유효구분기인자집선택。산법수선채용Pearson상관계수화Wilcoxon질화검험계산각기인여류표적상관성,근거통계상관성원칙선취여류표상관성교대적약간기인구성예선택기인자집；연후,채용K-means산법장예선택기인자집중고도상관적기인취집도동일류족,훈련 SVM 분류모형,계산매일개기인적권중,종매일류족선택일개권중최대혹자채용륜반도사상종매일류족선택일개득표수최다적기인작위본류족적대표기인,각류족적대표기인구성유효구분기인자집。장해산법여채용수궤책략선택각류족대표기인적수궤기인선택산법 Random, Guyon적경전기인선택산법SVM-RFE、채용순서전향수색책략적기인선택산법SVM-SFS진행실험비교,궤개경전기인수거집상적200차중복실험적평균실험결과표명：소제출적혼합기인선택산법능구선택도구분성능비상호적기인자집,건립재해구분기인자집상적분류기구유비상호적분류성능。
To deal with the challenging problem of recognizing the small number of distinguishable genes which can tell the cancer patients from normal people in a dataset with a small number of samples and tens of thousands of genes, novel hybrid gene selection algorithms are proposed in this paper based on the statistical correlation and K-means algorithm. The Pearson correlation coefficient and Wilcoxon signed-rank test are respectively adopted to calculate the importance of each gene to the classification to filter the least important genes and preserve about 10 percent of the important genes as the pre-selected gene subset. Then the related genes in the pre-selected gene subset are clustered via K-means algorithm, and the weight of each gene is calculated from the related coefficient of the SVM classifier. The most important gene, with the biggest weight or with the highest votes when the roulette wheel strategy is used, is chosen as the representative gene of each cluster to construct the distinguishable gene subset. In order to verify the effectiveness of the proposed hybrid gene subset selection algorithms, the random selection strategy (named Random) is also adopted to select the representative genes from clusters. The proposed distinguishable gene subset selection algorithms are compared with Random and the very popular gene selection algorithm SVM-RFE by Guyon and the pre-studied gene selection algorithm SVM-SFS. The average experimental results of 200 runs of the aforementioned gene selection algorithms on some classic and very popular gene expression datasets with extensive experiments demonstrate that the proposed distinguishable gene subset selection algorithms can find the optimal gene subset, and the classifier based on the selected gene subset achieves very high classification accuracy.