软件学报
軟件學報
연건학보
JOURNAL OF SOFTWARE
2003年
5期
976-983
,共8页
王晓玲%文继荣%栾金锋%马维英%董逸生
王曉玲%文繼榮%欒金鋒%馬維英%董逸生
왕효령%문계영%란금봉%마유영%동일생
文档数据库%信息检索%段落检索%结构化文档
文檔數據庫%信息檢索%段落檢索%結構化文檔
문당수거고%신식검색%단락검색%결구화문당
document database%information retrieval%passage retrieval%structured document
文档是有一定逻辑结构的,标题、章节、段落等这些概念是文档的内在逻辑.不同的用户对文档的检索,有不同的需求,检索系统如何提供有意义的信息,一直是研究的中心任务.结合文档的结构和内容,对结构化文件的检索,提出了一种新的计算相似度的方法.这种方法可以提供多粒度的文档内容的检索,包括从单词、短语到段落或者章节.基于这种方法实现了一个问题回答系统,测试集是微软的百科全书Encarta,通过与传统方法实验比较,证明通过这种方法检索的文章片断更合理、更有效.
文檔是有一定邏輯結構的,標題、章節、段落等這些概唸是文檔的內在邏輯.不同的用戶對文檔的檢索,有不同的需求,檢索繫統如何提供有意義的信息,一直是研究的中心任務.結閤文檔的結構和內容,對結構化文件的檢索,提齣瞭一種新的計算相似度的方法.這種方法可以提供多粒度的文檔內容的檢索,包括從單詞、短語到段落或者章節.基于這種方法實現瞭一箇問題迴答繫統,測試集是微軟的百科全書Encarta,通過與傳統方法實驗比較,證明通過這種方法檢索的文章片斷更閤理、更有效.
문당시유일정라집결구적,표제、장절、단락등저사개념시문당적내재라집.불동적용호대문당적검색,유불동적수구,검색계통여하제공유의의적신식,일직시연구적중심임무.결합문당적결구화내용,대결구화문건적검색,제출료일충신적계산상사도적방법.저충방법가이제공다립도적문당내용적검색,포괄종단사、단어도단락혹자장절.기우저충방법실현료일개문제회답계통,측시집시미연적백과전서Encarta,통과여전통방법실험비교,증명통과저충방법검색적문장편단경합리、경유효.
Structured documents are made up of a few logical components, such as title, sections, subsections andparagraphs. The components in each structured document can be represented by an ordered tree model, which canalso be viewed as a hierarchical concept relationship. To meet the user's requirements for more precise andconcentrated search results, the retrieval techniques should allow the user to retrieve document components withvarying granularity. This paper presents a method to query document database by content and structure. The keyidea is to construct a more comprehensive similarity function by taking advantage of the inherent hierarchicalstructure in documents. This work combines Information Retrieval techniques, semi-structured data query andproximate search for document documents. The proposed method is evaluated on the Encarta encyclopediadocument set and the experimental results show that it can provide more accurate and focused answers thantraditional document retrieval methods.