计算机系统应用
計算機繫統應用
계산궤계통응용
APPLICATIONS OF THE COMPUTER SYSTEMS
2014年
7期
24-30
,共7页
DNA序列%分类%特征表示%隐马尔科夫模型%特征值分解
DNA序列%分類%特徵錶示%隱馬爾科伕模型%特徵值分解
DNA서렬%분류%특정표시%은마이과부모형%특정치분해
DNA sequence%classification%feature representation%Hidden Markov Models (HMM)%eigenvalue decomposition
DNA序列分类是生物信息学的一项基础任务,目的是根据结构或功能的相似性预测DNA序列所属的类别。为进行有效分类,如何将序列映射到特征向量空间并最大程度地保留序列中蕴含的碱基间顺序关系是一项困难的任务。为克服现有方法容易导致因DNA序列碱基残缺而影响分类精度等问题,提出一种新的DNA序列特征表示方法。新方法首先为每条序列训练一个隐马尔科夫模型(HMM),然后将DNA序列投影到由HMM状态转移概率矩阵的特征向量构成的向量空间中。基于这种新的特征表示法,构造了一种 K-NN分类器对DNA序列进行分类。实验结果表明,新型特征表示方法可以较为完整地保留 DNA 序列中不同碱基间的关系,充分反映序列的结构信息,从而有效提高了序列的分类精度。
DNA序列分類是生物信息學的一項基礎任務,目的是根據結構或功能的相似性預測DNA序列所屬的類彆。為進行有效分類,如何將序列映射到特徵嚮量空間併最大程度地保留序列中蘊含的堿基間順序關繫是一項睏難的任務。為剋服現有方法容易導緻因DNA序列堿基殘缺而影響分類精度等問題,提齣一種新的DNA序列特徵錶示方法。新方法首先為每條序列訓練一箇隱馬爾科伕模型(HMM),然後將DNA序列投影到由HMM狀態轉移概率矩陣的特徵嚮量構成的嚮量空間中。基于這種新的特徵錶示法,構造瞭一種 K-NN分類器對DNA序列進行分類。實驗結果錶明,新型特徵錶示方法可以較為完整地保留 DNA 序列中不同堿基間的關繫,充分反映序列的結構信息,從而有效提高瞭序列的分類精度。
DNA서렬분류시생물신식학적일항기출임무,목적시근거결구혹공능적상사성예측DNA서렬소속적유별。위진행유효분류,여하장서렬영사도특정향량공간병최대정도지보류서렬중온함적감기간순서관계시일항곤난적임무。위극복현유방법용역도치인DNA서렬감기잔결이영향분류정도등문제,제출일충신적DNA서렬특정표시방법。신방법수선위매조서렬훈련일개은마이과부모형(HMM),연후장DNA서렬투영도유HMM상태전이개솔구진적특정향량구성적향량공간중。기우저충신적특정표시법,구조료일충 K-NN분류기대DNA서렬진행분류。실험결과표명,신형특정표시방법가이교위완정지보류 DNA 서렬중불동감기간적관계,충분반영서렬적결구신식,종이유효제고료서렬적분류정도。
DNA sequence classification is a basic task of bioinformatics, which aims at predicting the category of DNA sequences in terms of their structural or functional similarity. In order to perform an effective classification, how to map the sequences into a feature vector space while retaining the chronological relationships hidden in the sequences as much as possible is currently a difficult task. To address the problems of existing methods, which easily result in affecting the classification accuracy because of incomplete representation of the nucleotides in DNA sequences, in this paper, a new feature representation method for DNA sequence is proposed. In the new method, first, each sequence is used to train a Hidden Markov Model (HMM);then, the DNA sequences are projected onto a vector space spanned by the eigenvectors of the HMM state transition probability matrix. Based on the new feature representation, a K-Nearest Neighbour classifier is constructed to classify DNA sequences over the vector space. Experimental results show that the new feature representation is able to represent the chronological relationships between different nucleotides in a DNA sequences more integrally. Consequently, the structural information hidden in the sequences can be reflected fully, which in turn improve the classification accuracy of sequences.