现代电子技术
現代電子技術
현대전자기술
MODERN ELECTRONICS TECHNIQUE
2015年
4期
19-24
,共6页
蛋白质-ATP绑定位点%位置特异性得分矩阵%滑动窗口%支持向量回归模型%随机下采样
蛋白質-ATP綁定位點%位置特異性得分矩陣%滑動窗口%支持嚮量迴歸模型%隨機下採樣
단백질-ATP방정위점%위치특이성득분구진%활동창구%지지향량회귀모형%수궤하채양
protein-ATP binding locus%position specific scoring matrix%sliding window%SVR model%random under-sam-pling
将蛋白质序列的ATP绑定位点与非绑定位点进行分类是个不平衡的二分类问题,其中绑定位点是样本数目稀少的正类样本,非绑定位点是样本数目众多的负类样本。根据机器学习关于可以将分类问题作为回归问题的特例的观点出发,并根据所研究问题本身的特点,在此提出一种基于随机下采样和支持向量回归的蛋白质?ATP绑定位点预测方法。首先,使用滑动窗口抽取蛋白质序列中每个残基的特征,得到一批不平衡的两类样本;其次,应用随机下采样策略,消除正负样本存在的显著不平衡;最后,使用支持向量回归建立预测模型,并选取合适的阈值进行蛋白质?ATP绑定位点的预测。在标准数据集上的实验结果以及与几种最新报道的预测方法的对比结果,验证了本文所述方法的有效性。
將蛋白質序列的ATP綁定位點與非綁定位點進行分類是箇不平衡的二分類問題,其中綁定位點是樣本數目稀少的正類樣本,非綁定位點是樣本數目衆多的負類樣本。根據機器學習關于可以將分類問題作為迴歸問題的特例的觀點齣髮,併根據所研究問題本身的特點,在此提齣一種基于隨機下採樣和支持嚮量迴歸的蛋白質?ATP綁定位點預測方法。首先,使用滑動窗口抽取蛋白質序列中每箇殘基的特徵,得到一批不平衡的兩類樣本;其次,應用隨機下採樣策略,消除正負樣本存在的顯著不平衡;最後,使用支持嚮量迴歸建立預測模型,併選取閤適的閾值進行蛋白質?ATP綁定位點的預測。在標準數據集上的實驗結果以及與幾種最新報道的預測方法的對比結果,驗證瞭本文所述方法的有效性。
장단백질서렬적ATP방정위점여비방정위점진행분류시개불평형적이분류문제,기중방정위점시양본수목희소적정류양본,비방정위점시양본수목음다적부류양본。근거궤기학습관우가이장분류문제작위회귀문제적특례적관점출발,병근거소연구문제본신적특점,재차제출일충기우수궤하채양화지지향량회귀적단백질?ATP방정위점예측방법。수선,사용활동창구추취단백질서렬중매개잔기적특정,득도일비불평형적량류양본;기차,응용수궤하채양책략,소제정부양본존재적현저불평형;최후,사용지지향량회귀건립예측모형,병선취합괄적역치진행단백질?ATP방정위점적예측。재표준수거집상적실험결과이급여궤충최신보도적예측방법적대비결과,험증료본문소술방법적유효성。
An imbalanced binary classification problem does remain due to the fact that the sample size of binding residues (positive samples)is far less than that of non?binding class(negative samples)in protein?ATP binding residues prediction. In?spired by both the machine learning viewpoint that the classification problem can be regarded as a special case of the regression problem,and the characteristic of the bioinformatics problem concerned,a novel prediction method of protein?ATP binding resi?dues is proposed in this paper. This method is based on both the support vector regression(SVR)model and the random under?sampling strategy. The central idea can be presented as follows:firstly the sliding window is used to extract the features of every residue in the protein sequences,resulting in the imbalance binary class samples;secondly,the random under?sampling strate?gy is utilized to eliminate the significant imbalance between positive and negative samples;at last,based on SVR prediction model and the corresponding suitable threshold,the protein?ATP binding residues can be distinguished from the non?binding ones. Comparing with several state?of?art related methods,the effectiveness of the proposed method was validated by the experi?mental results on the standard data set.