计算机科学技术学报(英文版)
計算機科學技術學報(英文版)
계산궤과학기술학보(영문판)
COMPUTER JOURNAL OF SCIENCE AND TECHNOLOGY
2002年
6期
807-819
,共13页
周傲英%钱卫宁%钱海蕾%张龙%梁宇奇%金文
週傲英%錢衛寧%錢海蕾%張龍%樑宇奇%金文
주오영%전위저%전해뢰%장룡%량우기%금문
clustering%XML (eXtensible Markup Language)%DTD (Document Type Def-inition)
XML (eXtensible Markup Language) is a standard which is widely appliedin data representation and data exchange. However, as an important concept of XML, DTD(Document Type Definition) is not taken full advantage in current applications. In this paper, anew method for clustering DTDs is presented, and it can be used in XML document clustering.The two-level method clusters the elements in DTDs and clusters DTDs separately. Elementclustering forms the first level and provides element clusters, which are the generalization ofrelevant elements. DTD clustering utilizes the generalized information and forms the secondlevel in the whole clustering process. The two-level method has the following advantages: 1) Ittakes into consideration both the content and the structure within DTDs; 2) The generalizedinformation about elements is more useful than the separated words in the vector model; 3) Thetwo-level method facilitates the searching of outliers. The experiments show that this methodis able to categorize the relevant DTDs effectively.