东南大学学报(英文版)
東南大學學報(英文版)
동남대학학보(영문판)
JOURNAL OF SOUTHEAST UNIVERSITY
2006年
3期
370-374
,共5页
罗娜%左万利%袁福宇%张靖波%张慧杰
囉娜%左萬利%袁福宇%張靖波%張慧傑
라나%좌만리%원복우%장정파%장혜걸
本体%文本聚类%词典%WordNet
本體%文本聚類%詞典%WordNet
본체%문본취류%사전%WordNet
ontology%text clustering%lexicon%WordNet
为了提高聚类结果和允许在结果中进行选择,将本体语义与文档聚类相结合,在文档处理过程中提出了基于WordNet的新的文档聚类算法.首先通过tf-idf对文档进行了表示,为了将WordNet的概念出现在文档集合中,通过新的实体对每一个单词向量进行扩展.其次,运用特征提取算法对文档进行特征提取.最后提出了本体集合聚类算法用以提高文本的聚类效果.实验构建在Reuters 20新闻组的数据基础上,应用互信息作为试验结果的比较.结果表明:与已经存在的一些算法如MNB,CLUTO,co-clustering等相比,基于本体的聚类算法在文本聚类上有很明显的提高.
為瞭提高聚類結果和允許在結果中進行選擇,將本體語義與文檔聚類相結閤,在文檔處理過程中提齣瞭基于WordNet的新的文檔聚類算法.首先通過tf-idf對文檔進行瞭錶示,為瞭將WordNet的概唸齣現在文檔集閤中,通過新的實體對每一箇單詞嚮量進行擴展.其次,運用特徵提取算法對文檔進行特徵提取.最後提齣瞭本體集閤聚類算法用以提高文本的聚類效果.實驗構建在Reuters 20新聞組的數據基礎上,應用互信息作為試驗結果的比較.結果錶明:與已經存在的一些算法如MNB,CLUTO,co-clustering等相比,基于本體的聚類算法在文本聚類上有很明顯的提高.
위료제고취류결과화윤허재결과중진행선택,장본체어의여문당취류상결합,재문당처리과정중제출료기우WordNet적신적문당취류산법.수선통과tf-idf대문당진행료표시,위료장WordNet적개념출현재문당집합중,통과신적실체대매일개단사향량진행확전.기차,운용특정제취산법대문당진행특정제취.최후제출료본체집합취류산법용이제고문본적취류효과.실험구건재Reuters 20신문조적수거기출상,응용호신식작위시험결과적비교.결과표명:여이경존재적일사산법여MNB,CLUTO,co-clustering등상비,기우본체적취류산법재문본취류상유흔명현적제고.
In order to improve the clustering results and select in the results,the ontology semantic is combined with document clustering.A new document clustering algorithm based WordNet in the phrase of document processing is proposed.First,every word vector by new entities is extended after the documents are represented by tf-idf.Then the feature extracting algorithm is applied for the documents.Finally,the algorithm of ontology aggregation clustering (OAC) is proposed to improve the result of document clustering.Experiments are based on the data set of Reuters 20 News Group,and experimental results are compared with the results obtained by mutual information(MI).The conclusion draws that the proposed algorithm of document clustering based on ontology is better than the other existed clustering algorithms such as MNB,CLUTO,co-clustering,etc.