物理化学学报
物理化學學報
물이화학학보
ACTA PHYSICO-CHIMICA SINICA
2013年
3期
498-507
,共10页
王志明%韩娜%袁哲明*%伍朝华
王誌明%韓娜%袁哲明*%伍朝華
왕지명%한나%원철명*%오조화
定量构效关系%岭回归%支持向量机%特征选择%高维特征
定量構效關繫%嶺迴歸%支持嚮量機%特徵選擇%高維特徵
정량구효관계%령회귀%지지향량궤%특정선택%고유특정
Quantitative structure-activity relationship%Support vector machine%Ridge regression%Feature selection%High-dimensional feature
岭回归估计权重绝对值在一定程度上体现了对应特征作用大小,据此发展了基于岭回归(RR)和支持向量机(SVM)的高维特征选择算法.对苦味二肽(BTT)和细胞毒性T淋巴细胞(CTL)表位9肽两个肽体系,以氨基酸的531个物理化学性质参数直接表征肽结构,各获得1062、4779个初始特征;对训练集,初始特征以岭回归排序后序贯引入,当SVM留一法交叉测试(LOOCV)的均方误差(MSE)显著上扬时终止,最后以多轮末尾淘汰进一步精筛,分别获得7、18个物理化学意义明确的保留特征.基于保留特征与支持向量回归(SVR),对训练集建立定量构效关系(QSAR)模型,预测独立测试集,其拟合精度、留一法交叉测试精度、独立预测精度均优于现有文献报道结果.新方法运行速度快,选取的特征物理化学意义明确,解释性强,在肽、蛋白质定量构效关系建模等高维数据回归预测领域有较广泛应用前景.
嶺迴歸估計權重絕對值在一定程度上體現瞭對應特徵作用大小,據此髮展瞭基于嶺迴歸(RR)和支持嚮量機(SVM)的高維特徵選擇算法.對苦味二肽(BTT)和細胞毒性T淋巴細胞(CTL)錶位9肽兩箇肽體繫,以氨基痠的531箇物理化學性質參數直接錶徵肽結構,各穫得1062、4779箇初始特徵;對訓練集,初始特徵以嶺迴歸排序後序貫引入,噹SVM留一法交扠測試(LOOCV)的均方誤差(MSE)顯著上颺時終止,最後以多輪末尾淘汰進一步精篩,分彆穫得7、18箇物理化學意義明確的保留特徵.基于保留特徵與支持嚮量迴歸(SVR),對訓練集建立定量構效關繫(QSAR)模型,預測獨立測試集,其擬閤精度、留一法交扠測試精度、獨立預測精度均優于現有文獻報道結果.新方法運行速度快,選取的特徵物理化學意義明確,解釋性彊,在肽、蛋白質定量構效關繫建模等高維數據迴歸預測領域有較廣汎應用前景.
령회귀고계권중절대치재일정정도상체현료대응특정작용대소,거차발전료기우령회귀(RR)화지지향량궤(SVM)적고유특정선택산법.대고미이태(BTT)화세포독성T림파세포(CTL)표위9태량개태체계,이안기산적531개물이화학성질삼수직접표정태결구,각획득1062、4779개초시특정;대훈련집,초시특정이령회귀배서후서관인입,당SVM류일법교차측시(LOOCV)적균방오차(MSE)현저상양시종지,최후이다륜말미도태진일보정사,분별획득7、18개물이화학의의명학적보류특정.기우보류특정여지지향량회귀(SVR),대훈련집건립정량구효관계(QSAR)모형,예측독립측시집,기의합정도、류일법교차측시정도、독립예측정도균우우현유문헌보도결과.신방법운행속도쾌,선취적특정물이화학의의명학,해석성강,재태、단백질정량구효관계건모등고유수거회귀예측영역유교엄범응용전경.
Absolute weight values estimated from test data by ridge regression (RR) can reflect the significance of corresponding features. Based on RR and support vector machine (SVM), a new feature selection algorithm for high-dimensional data is proposed. Examples from bitter tasting thresholds (BTT) and cytotoxic T lymphocyte (CTL) epitopes are presented. Al 531 physicochemical property parameters were employed to express each residue of one peptide, thus 1062 and 4779 descriptors were obtained for BTT and CTL, respectively. Each sample was divided into training and test sets, and weight estimates of al training set descriptors were generated by RR. According to the descending order of the weights, corresponding features were gradual y selected until the mean square error (MSE) of leave-one-out cross validation (LOOCV) increased significantly. Based on smal er training datasets obtained from the previous step, the reserved features were available from multiple elimination rounds. 7 and 18 descriptors were selected by the new method for BTT and CTL, respectively. A quantitative structure-activity relationship (QSAR) model based on support vector regression (SVR) was established on extracted data with the reserved descriptors, and was then used for test data prediction. The fitting, LOOCV, and external prediction accuracies were significantly improved with respect to reported literature values. Because of the calculation speed, clear physicochemical meaning, and ease of interpretation, the new method is widely applicable to regression forecasting of high-dimensional data such as QSAR modeling of peptide or proteins.