计算机研究与发展
計算機研究與髮展
계산궤연구여발전
JOURNAL OF COMPUTER RESEARCH AND DEVELOPMENT
2010年
2期
264-276
,共13页
序列数据%相似性度量%距离分布%过滤方法%相似性查询
序列數據%相似性度量%距離分佈%過濾方法%相似性查詢
서렬수거%상사성도량%거리분포%과려방법%상사성사순
sequence data%similarity metric%distance distribution%filtering technique%similarity query
序列数据在文本、Web访问日志文件、生物数据库等应用中普遍存在,对其进行相似性查询是一种提取有用信息的重要手段.近年来,随着各种科学计算的发展和序列数据的大量产生,序列相似性查询已经成为数据分析领域一个研究热点.其涉及到的几个重要问题有面向各种应用领域的相似性度量及其相互之间的关系;随机序列数据中距离分布的统计信息及其对分析查询算法性能的作用;在大规模数据中,各种高效回答相似性查询的关键技术及各自的优缺点比较.总结了序列数据的分类和特点,给出了几种序列数据相似性度量和随机序列之间距离分布的统计信息,并进一步分析了这些度量之间的关系.接着给出了几种序列相似性查询的类型,以及序列相似性查询要解决的核心问题.在此基础上,针对各种序列相似性查询关键技术进行分类和评价.最后,讨论了关于序列数据相似性查询研究所面临的挑战,并归结了未来的研究方向.
序列數據在文本、Web訪問日誌文件、生物數據庫等應用中普遍存在,對其進行相似性查詢是一種提取有用信息的重要手段.近年來,隨著各種科學計算的髮展和序列數據的大量產生,序列相似性查詢已經成為數據分析領域一箇研究熱點.其涉及到的幾箇重要問題有麵嚮各種應用領域的相似性度量及其相互之間的關繫;隨機序列數據中距離分佈的統計信息及其對分析查詢算法性能的作用;在大規模數據中,各種高效迴答相似性查詢的關鍵技術及各自的優缺點比較.總結瞭序列數據的分類和特點,給齣瞭幾種序列數據相似性度量和隨機序列之間距離分佈的統計信息,併進一步分析瞭這些度量之間的關繫.接著給齣瞭幾種序列相似性查詢的類型,以及序列相似性查詢要解決的覈心問題.在此基礎上,針對各種序列相似性查詢關鍵技術進行分類和評價.最後,討論瞭關于序列數據相似性查詢研究所麵臨的挑戰,併歸結瞭未來的研究方嚮.
서렬수거재문본、Web방문일지문건、생물수거고등응용중보편존재,대기진행상사성사순시일충제취유용신식적중요수단.근년래,수착각충과학계산적발전화서렬수거적대양산생,서렬상사성사순이경성위수거분석영역일개연구열점.기섭급도적궤개중요문제유면향각충응용영역적상사성도량급기상호지간적관계;수궤서렬수거중거리분포적통계신식급기대분석사순산법성능적작용;재대규모수거중,각충고효회답상사성사순적관건기술급각자적우결점비교.총결료서렬수거적분류화특점,급출료궤충서렬수거상사성도량화수궤서렬지간거리분포적통계신식,병진일보분석료저사도량지간적관계.접착급출료궤충서렬상사성사순적류형,이급서렬상사성사순요해결적핵심문제.재차기출상,침대각충서렬상사성사순관건기술진행분류화평개.최후,토론료관우서렬수거상사성사순연구소면림적도전,병귀결료미래적연구방향.
Sequence data is ubiquitous in many domains such as text, Web access log and biological database. Similarity query in sequence data is a very important means for extracting useful information. Recently, with the development of various scientific computing and the generation of large scale sequence data, similarity query on sequence data is becoming a hot research topic. Some important issues related to it are: similarity metrics used in different application fields and the mutual connections between them;statistical information of distance distribution on random sequence collections as well as its function for analyzing the performance of query algorithms;different kinds of key techniques for efficiently answering similarity queries in large scale datasets and the comparisons between their merits and demerits. In this survey, the classification and characteristics of sequence data is summarized. Some kinds of similarity metrics and statistical information about distance between random sequences are also presented and the relationships among these similarity metrics are further analyzed. Then, some types of similarity query and key issues in point are introduced. Based on these foundations, this paper focuses on the classification and evaluation of key techniques on sequence similarity search. Finally, some challenges on similarity query of sequence data are discussed and future research trends are also summarized.