计算机学报
計算機學報
계산궤학보
CHINESE JOURNAL OF COMPUTERS
2010年
4期
747-754
,共8页
重复体识别%适应性后缀树%Ukkonen算法%RepSeeker算法
重複體識彆%適應性後綴樹%Ukkonen算法%RepSeeker算法
중복체식별%괄응성후철수%Ukkonen산법%RepSeeker산법
repeats identification%adaptive suffix tree%Ukkonen algorithm%RepSeeker algorithm
现有的在DNA序列中识别重复体的算法多数是基于比对的,对识别速度和吞吐量有很大的限制.针对这个问题文中根据一个平衡重复体的长度和频率的定义,提出了一种基于Ukkonen后缀树的快速识别重复体的RepSeeker算法.算法采用最低限制频率,最大程度地扩展了重复体的长度,同时为了进一步地提高RepSeeker算法的效率,对Ukkonen的后缀树构造算法进行了适应性改进,在构造时加入RepSeeker算法所需的结点信息并将叶子结点和分支结点加以区分,从而使得RepSeeker算法能通过直接读取结点信息来求得子串频率和子串位置.这种改进较大地提高了RepSeeker算法的性能,而且空间开销不大.实验中使用了NCBI中的9条典型DNA序列作为测试数据,并对后缀树改进前后的重复体识别算法做了比较分析.结果表明,RepSeeker在没有损失精度的情况下缩短了算法的运行时间.实验结果与理论上的分析一致.
現有的在DNA序列中識彆重複體的算法多數是基于比對的,對識彆速度和吞吐量有很大的限製.針對這箇問題文中根據一箇平衡重複體的長度和頻率的定義,提齣瞭一種基于Ukkonen後綴樹的快速識彆重複體的RepSeeker算法.算法採用最低限製頻率,最大程度地擴展瞭重複體的長度,同時為瞭進一步地提高RepSeeker算法的效率,對Ukkonen的後綴樹構造算法進行瞭適應性改進,在構造時加入RepSeeker算法所需的結點信息併將葉子結點和分支結點加以區分,從而使得RepSeeker算法能通過直接讀取結點信息來求得子串頻率和子串位置.這種改進較大地提高瞭RepSeeker算法的性能,而且空間開銷不大.實驗中使用瞭NCBI中的9條典型DNA序列作為測試數據,併對後綴樹改進前後的重複體識彆算法做瞭比較分析.結果錶明,RepSeeker在沒有損失精度的情況下縮短瞭算法的運行時間.實驗結果與理論上的分析一緻.
현유적재DNA서렬중식별중복체적산법다수시기우비대적,대식별속도화탄토량유흔대적한제.침대저개문제문중근거일개평형중복체적장도화빈솔적정의,제출료일충기우Ukkonen후철수적쾌속식별중복체적RepSeeker산법.산법채용최저한제빈솔,최대정도지확전료중복체적장도,동시위료진일보지제고RepSeeker산법적효솔,대Ukkonen적후철수구조산법진행료괄응성개진,재구조시가입RepSeeker산법소수적결점신식병장협자결점화분지결점가이구분,종이사득RepSeeker산법능통과직접독취결점신식래구득자천빈솔화자천위치.저충개진교대지제고료RepSeeker산법적성능,이차공간개소불대.실험중사용료NCBI중적9조전형DNA서렬작위측시수거,병대후철수개진전후적중복체식별산법주료비교분석.결과표명,RepSeeker재몰유손실정도적정황하축단료산법적운행시간.실험결과여이론상적분석일치.
Many existing methods for repeats identification are based on alignments. Their speed and time significantly limit their applications. This paper presents the fast Rep(eats)Seeker algorithm for repeats identification based on the adaptive Ukkonen suffix tree construction algorithm. The RepSeeker algorithm uses the lowest frequency limit to maximize the extension of repeats. The adaptive improvements to the Ukkonen algorithm are made to increase the efficiency of the RepSeeker algorithm. The node information required by the RepSeeker algorithm is added during the suffix tree construction. Because information on leaves and branch nodes are different, the RepSeeker algorithm directly obtains the needed information from the nodes to find out the frequency and locate the positions of a substring. The improvement is considerable for the repeats identification at a little extra cost in space. Nine sequences from the National Center for Biotechnology Information(NCBI) are used to test the performance of the RepSeeker algorithm. Comparisons between before and after improvements of the suffix tree construction show that the running time of the RepSeeker algorithm is reduced without losing the accuracy. The experimental results agree with the theoretical expectations.