计算机科学与探索
計算機科學與探索
계산궤과학여탐색
Journal of Frontiers of Computer Science & Technology
2015年
10期
1180-1194
,共15页
王慧锋%段磊%胡斌%邓松%王文韬%秦攀
王慧鋒%段磊%鬍斌%鄧鬆%王文韜%秦攀
왕혜봉%단뢰%호빈%산송%왕문도%진반
数据质量%概率后缀树%间隔约束
數據質量%概率後綴樹%間隔約束
수거질량%개솔후철수%간격약속
data quality%probabilistic suffix tree%gap constraint
序列数据广泛存在于实际应用中,因此关于序列数据挖掘的算法研究一直都是热点。序列数据的质量关系到挖掘结果的可靠性,传统的数据质量评价方法多通过统计指标来分析数据的质量问题,但统计指标无法对非结构化序列数据中各元素之间的关系进行评估。为检测序列数据质量,提出了基于概率后缀树模型的序列数据质量评价算法。具体地,在满足间隔约束的前提下,根据数据质量可靠的序列数据样本生成概率后缀树,并根据概率后缀树对查询序列数据进行质量评价。最后通过真实序列数据集验证了算法的有效性、执行效率和伸缩性。
序列數據廣汎存在于實際應用中,因此關于序列數據挖掘的算法研究一直都是熱點。序列數據的質量關繫到挖掘結果的可靠性,傳統的數據質量評價方法多通過統計指標來分析數據的質量問題,但統計指標無法對非結構化序列數據中各元素之間的關繫進行評估。為檢測序列數據質量,提齣瞭基于概率後綴樹模型的序列數據質量評價算法。具體地,在滿足間隔約束的前提下,根據數據質量可靠的序列數據樣本生成概率後綴樹,併根據概率後綴樹對查詢序列數據進行質量評價。最後通過真實序列數據集驗證瞭算法的有效性、執行效率和伸縮性。
서렬수거엄범존재우실제응용중,인차관우서렬수거알굴적산법연구일직도시열점。서렬수거적질량관계도알굴결과적가고성,전통적수거질량평개방법다통과통계지표래분석수거적질량문제,단통계지표무법대비결구화서렬수거중각원소지간적관계진행평고。위검측서렬수거질량,제출료기우개솔후철수모형적서렬수거질량평개산법。구체지,재만족간격약속적전제하,근거수거질량가고적서렬수거양본생성개솔후철수,병근거개솔후철수대사순서렬수거진행질량평개。최후통과진실서렬수거집험증료산법적유효성、집행효솔화신축성。
Sequential data, which widely exists in real world applications, is an important research topic in data mining. The reliability of the mining results depends on the quality of sequences. Traditional data quality evaluation methods analyze the data quality problem by statistical indicator, but the statistical indicator can?t evaluate the relationship of each element in the unstructured sequence. To detect the quality of a sequence, this paper proposes a quality evalua-tion algorithm for sequential data using the probability suffix tree. Specifically, under the specified gap constraint, a probability suffix tree is built based on the sequences with reliable quality. Then, the tree is used for evaluating the quality of a query sequence. Finally, experiments on real-world sequence sets confirm the effectiveness, efficiency and scalability of the proposed algorithm.