计算机工程
計算機工程
계산궤공정
COMPUTER ENGINEERING
2013年
12期
181-185,190
,共6页
聚类分析%蛋白质序列%广义置换式匹配相似度%仿射传播聚类%哈夫曼判定%F-measure指标
聚類分析%蛋白質序列%廣義置換式匹配相似度%倣射傳播聚類%哈伕曼判定%F-measure指標
취류분석%단백질서렬%엄의치환식필배상사도%방사전파취류%합부만판정%F-measure지표
clustering analysis%protein sequence%generalized Substitution Matching Similarity(gSMS)%Affinity Propagation(AP) clustering%Huffman decision%F-measure index
已有的仿射传播聚类算法不能很好地反映复杂蛋白质序列本身的聚类结构。为此,提出一种基于哈夫曼判定的蛋白质分类方法。在计算广义置换式匹配相似度的基础上,使用已有的自适应仿射传播算法聚类蛋白质序列。采用哈夫曼编码方法,通过限制平均码长使聚类结果能反映蛋白质序列家族的聚类结构。在蛋白质同源聚类数据库和蛋白质结构分类数据库的6个数据集上进行实验,结果表明,该方法与adAP、谱聚类、SMS和TribeMCL方法相比,不仅能获得更接近于数据集家族的聚类数目及更紧凑的聚类结构,而且F-measure指标平均估值分别高出19.67%、8.7%、9.5%和43.51%。
已有的倣射傳播聚類算法不能很好地反映複雜蛋白質序列本身的聚類結構。為此,提齣一種基于哈伕曼判定的蛋白質分類方法。在計算廣義置換式匹配相似度的基礎上,使用已有的自適應倣射傳播算法聚類蛋白質序列。採用哈伕曼編碼方法,通過限製平均碼長使聚類結果能反映蛋白質序列傢族的聚類結構。在蛋白質同源聚類數據庫和蛋白質結構分類數據庫的6箇數據集上進行實驗,結果錶明,該方法與adAP、譜聚類、SMS和TribeMCL方法相比,不僅能穫得更接近于數據集傢族的聚類數目及更緊湊的聚類結構,而且F-measure指標平均估值分彆高齣19.67%、8.7%、9.5%和43.51%。
이유적방사전파취류산법불능흔호지반영복잡단백질서렬본신적취류결구。위차,제출일충기우합부만판정적단백질분류방법。재계산엄의치환식필배상사도적기출상,사용이유적자괄응방사전파산법취류단백질서렬。채용합부만편마방법,통과한제평균마장사취류결과능반영단백질서렬가족적취류결구。재단백질동원취류수거고화단백질결구분류수거고적6개수거집상진행실험,결과표명,해방법여adAP、보취류、SMS화TribeMCL방법상비,불부능획득경접근우수거집가족적취류수목급경긴주적취류결구,이차F-measure지표평균고치분별고출19.67%、8.7%、9.5%화43.51%。
Existed Affinity Propagation(AP) clustering algorithm can not reflect the clustering structure of the complex protein sequences, This paper proposes an adaptive AP classification method based on Generalized SMS and Huffman Decision(adAP/GSHD). Protein sequences are clustered via generalized Substitution Matching Similarity(gSMS) and existed adaptive affinity propagation(adAP) algorithm. It uses Huffman coding and confines the average code length of clustering results to embody the family clustering structure of protein sequences. By experiment of test adAP/GSHD and comparing its performance with other four classic clustering methods on six datasets of Clusters of Orthologous Groups(COG) of proteins database and Structural Classification of Proteins(SCOP) database, results demonstrate that this method not only can acquire number of clusters more approximately to the correct family number of clusters and more compact clustering structure for a given set of proteins, but also the average F-measure is 19.67%, 8.7%, 9.5%and 43.81%better than that of adAP, SMS, Spectral Clustering and TribeMCL respectively.