计算机研究与发展
計算機研究與髮展
계산궤연구여발전
JOURNAL OF COMPUTER RESEARCH AND DEVELOPMENT
2015年
7期
1546-1557
,共12页
阎芳%李元章%张全新%谭毓安
閻芳%李元章%張全新%譚毓安
염방%리원장%장전신%담육안
变长分块%对象%非结构化数据%OpenXML 标准%复合文件%重复数据删除
變長分塊%對象%非結構化數據%OpenXML 標準%複閤文件%重複數據刪除
변장분괴%대상%비결구화수거%OpenXML 표준%복합문건%중복수거산제
content defined chunking (CDC)%object%unstructured data%OpenXML standard%compound file%data de-duplication
现有的重复数据删除技术大部分是基于变长分块(content defined chunking ,CDC)算法的,不考虑不同文件类型的内容特征。这种方法以一种随机的方式确定分块边界并应用于所有文件类型,已经证明其非常适合于文本和简单内容,而不适合非结构化数据构成的复合文件。分析了 OpenXML 标准的复合文件属性,给出了对象提取的基本方法,并提出基于对象分布和对象结构的去重粒度确定算法。目的是对于非结构化数据构成的复合文件,有效地检测不同文件中和同一文件不同位置的相同对象,在文件物理布局改变时也能够有效去重。通过对典型的非结构化数据集合的模拟实验表明,在综合情况下,对象重复数据删除比 CDC 方法提高了10%左右的非结构化数据的去重率。
現有的重複數據刪除技術大部分是基于變長分塊(content defined chunking ,CDC)算法的,不攷慮不同文件類型的內容特徵。這種方法以一種隨機的方式確定分塊邊界併應用于所有文件類型,已經證明其非常適閤于文本和簡單內容,而不適閤非結構化數據構成的複閤文件。分析瞭 OpenXML 標準的複閤文件屬性,給齣瞭對象提取的基本方法,併提齣基于對象分佈和對象結構的去重粒度確定算法。目的是對于非結構化數據構成的複閤文件,有效地檢測不同文件中和同一文件不同位置的相同對象,在文件物理佈跼改變時也能夠有效去重。通過對典型的非結構化數據集閤的模擬實驗錶明,在綜閤情況下,對象重複數據刪除比 CDC 方法提高瞭10%左右的非結構化數據的去重率。
현유적중복수거산제기술대부분시기우변장분괴(content defined chunking ,CDC)산법적,불고필불동문건류형적내용특정。저충방법이일충수궤적방식학정분괴변계병응용우소유문건류형,이경증명기비상괄합우문본화간단내용,이불괄합비결구화수거구성적복합문건。분석료 OpenXML 표준적복합문건속성,급출료대상제취적기본방법,병제출기우대상분포화대상결구적거중립도학정산법。목적시대우비결구화수거구성적복합문건,유효지검측불동문건중화동일문건불동위치적상동대상,재문건물리포국개변시야능구유효거중。통과대전형적비결구화수거집합적모의실험표명,재종합정황하,대상중복수거산제비 CDC 방법제고료10%좌우적비결구화수거적거중솔。
Content defined chunking (CDC) is a prevalent data de‐duplication algorithm for removing redundant data segments in storage systems .Current researches on CDC do not consider the unique content characteristic of different file types ,and they determine chunk boundaries in a random way and apply a single strategy for all the file types .It has been proven that such method is suitable for text and simple contents , and it doesn’t achieve the optimal performance for compound files . Compound file is composed of unstructured data ,usually occupying large storage space and containing multimedia data .Object‐based data de‐duplication is the current most advanced method and is the effective solution for detecting duplicate data for such files .We analyze the content characteristic of OpenXML files and develop an object extraction method .A de‐duplication granularity determining algorithm based on the object structure and distribution is proposed during this process .The purpose is to effectively detect the same objects in a file or between the different files ,and to be effectively de‐duplicated when the file physical layout is changed for compound files . Through the simulation experiments with typical unstructured data collection ,the efficiency is promoted by 10% compared with CDC method in the unstructured data in general .