物理化学学报
物理化學學報
물이화학학보
ACTA PHYSICO-CHIMICA SINICA
2014年
6期
1091-1098
,共8页
李咏%周玮%代志军%陈渊%王志明%袁哲明
李詠%週瑋%代誌軍%陳淵%王誌明%袁哲明
리영%주위%대지군%진연%왕지명%원철명
蛋白质折叠%折叠速率预测%高维特征%特征筛选%支持向量回归
蛋白質摺疊%摺疊速率預測%高維特徵%特徵篩選%支持嚮量迴歸
단백질절첩%절첩속솔예측%고유특정%특정사선%지지향량회귀
Protein folding%Folding rate prediction%High-dimensional feature%Feature screening%Support vector regression
折叠速率预测对阐明蛋白质折叠机理意义重大.本文收集了115条目前已知折叠速率的蛋白质样本(包括二态、多态和混态蛋白),为了较全面地表征蛋白质分子的一级结构信息,提取序列长度、氨基酸残基多尺度组分、成对残基k-space特征与基于残基物理化学性质的地统计学关联总共9357维特征.经改进的二元矩阵重排过滤器和多轮末尾淘汰非线性筛选,获得23个物理化学意义明确的保留特征,建立的非线性支持向量回归模型Jackknife交叉验证的相关系数R=0.95,优于文献报道及其他参比特征选择方法.支持向量回归解释体系表明折叠速率与保留描述符的非线性回归极显著,分析了各保留描述符对折叠速率的影响,结果表明蛋白质折叠速率与序列长度、中短程关联特征、三联体残基组份特征等密切相关.
摺疊速率預測對闡明蛋白質摺疊機理意義重大.本文收集瞭115條目前已知摺疊速率的蛋白質樣本(包括二態、多態和混態蛋白),為瞭較全麵地錶徵蛋白質分子的一級結構信息,提取序列長度、氨基痠殘基多呎度組分、成對殘基k-space特徵與基于殘基物理化學性質的地統計學關聯總共9357維特徵.經改進的二元矩陣重排過濾器和多輪末尾淘汰非線性篩選,穫得23箇物理化學意義明確的保留特徵,建立的非線性支持嚮量迴歸模型Jackknife交扠驗證的相關繫數R=0.95,優于文獻報道及其他參比特徵選擇方法.支持嚮量迴歸解釋體繫錶明摺疊速率與保留描述符的非線性迴歸極顯著,分析瞭各保留描述符對摺疊速率的影響,結果錶明蛋白質摺疊速率與序列長度、中短程關聯特徵、三聯體殘基組份特徵等密切相關.
절첩속솔예측대천명단백질절첩궤리의의중대.본문수집료115조목전이지절첩속솔적단백질양본(포괄이태、다태화혼태단백),위료교전면지표정단백질분자적일급결구신식,제취서렬장도、안기산잔기다척도조분、성대잔기k-space특정여기우잔기물이화학성질적지통계학관련총공9357유특정.경개진적이원구진중배과려기화다륜말미도태비선성사선,획득23개물이화학의의명학적보류특정,건립적비선성지지향량회귀모형Jackknife교차험증적상관계수R=0.95,우우문헌보도급기타삼비특정선택방법.지지향량회귀해석체계표명절첩속솔여보류묘술부적비선성회귀겁현저,분석료각보류묘술부대절첩속솔적영향,결과표명단백질절첩속솔여서렬장도、중단정관련특정、삼련체잔기조빈특정등밀절상관.
Folding rate prediction plays an important role in clarifying the protein folding mechanism. In this work, we col ected 115 protein samples with known folding rates including two-, multi-, and mixed-state proteins. To characterize the primary structure information of the protein molecules more comprehensively, we considered sequence length, residue components with different scales, k-space features for pair residues, and geostatistics association features among different locations of the residues substituted with corresponding physical-chemical properties. Each protein sequence was represented by a numeric vector containing 9357 numbers. We selected 23 features with a clear meaning from the above-mentioned high-dimensional features for each sample, after conducting an improved binary matrix shuffling filter and a worst descriptor elimination multi-round method. We constructed a nonlinear support vector regression (SVR) model based on the folding rate and the 23 retained features. The correlation coefficient of the Jackknife cross validation was 0.95. Our prediction accuracy was superior to other results from the literature and other reference feature selection methods. Final y, we established an interpretability system for SVR, and our data showed that the nonlinear regression relationship between the folding rates and the reserved features was highly significant. By further analyzing the effects of each retained descriptor on protein folding rates, the results showed that the protein folding rate might be closely related to the sequence length, the features associated with the medium-and short-range, the triplet residues component features, etc.