图书与情报
圖書與情報
도서여정보
LIBRARY AND INFORMATION
2015年
3期
118-124,144
,共8页
互联网信息%分类体系%中图法%语料库
互聯網信息%分類體繫%中圖法%語料庫
호련망신식%분류체계%중도법%어료고
internet information%classification system%chinese library classification%corpus
分类体系是信息组织的有效形式,传统文献分类体系难以适用分类对象的转变,实用性不足,已有的网络分类体系则缺乏科学性。构建融合实用性与科学性的互联网信息分类体系,能够有效满足用户信息需求,且是自动文本分类技术研究的基础。文章分别以中图法、新浪门户为例,研究传统文献分类法与网络信息分类法的优缺点,提出互联网信息分类体系的实用性、科学性以及均衡性设计原则,基于三个设计原则构建了互联网信息分类体系。为了验证所构建的分类体系的有效性,通过网络爬虫抓取网易门户以及腾讯网的语料作为实验数据,与复旦语料库的分类体系进行对比实验。实验结果表明,相比于复旦语料库的分类体系,文章所提出的互联网信息分类体系具有更高的实用性,且能更为全面地涵盖各种互联网信息,类目之间交叉度小,各个类目信息量接近,文本分类效果更为理想。
分類體繫是信息組織的有效形式,傳統文獻分類體繫難以適用分類對象的轉變,實用性不足,已有的網絡分類體繫則缺乏科學性。構建融閤實用性與科學性的互聯網信息分類體繫,能夠有效滿足用戶信息需求,且是自動文本分類技術研究的基礎。文章分彆以中圖法、新浪門戶為例,研究傳統文獻分類法與網絡信息分類法的優缺點,提齣互聯網信息分類體繫的實用性、科學性以及均衡性設計原則,基于三箇設計原則構建瞭互聯網信息分類體繫。為瞭驗證所構建的分類體繫的有效性,通過網絡爬蟲抓取網易門戶以及騰訊網的語料作為實驗數據,與複旦語料庫的分類體繫進行對比實驗。實驗結果錶明,相比于複旦語料庫的分類體繫,文章所提齣的互聯網信息分類體繫具有更高的實用性,且能更為全麵地涵蓋各種互聯網信息,類目之間交扠度小,各箇類目信息量接近,文本分類效果更為理想。
분류체계시신식조직적유효형식,전통문헌분류체계난이괄용분류대상적전변,실용성불족,이유적망락분류체계칙결핍과학성。구건융합실용성여과학성적호련망신식분류체계,능구유효만족용호신식수구,차시자동문본분류기술연구적기출。문장분별이중도법、신랑문호위례,연구전통문헌분류법여망락신식분류법적우결점,제출호련망신식분류체계적실용성、과학성이급균형성설계원칙,기우삼개설계원칙구건료호련망신식분류체계。위료험증소구건적분류체계적유효성,통과망락파충조취망역문호이급등신망적어료작위실험수거,여복단어료고적분류체계진행대비실험。실험결과표명,상비우복단어료고적분류체계,문장소제출적호련망신식분류체계구유경고적실용성,차능경위전면지함개각충호련망신식,류목지간교차도소,각개류목신식량접근,문본분류효과경위이상。
The classification system is an effective method of information organization. The traditional classification system can not adapt to the transformation of classification object and is no longer practical; at the same time, the existing network classification system is not scientific. An Internet information classification system both practical and scientific can not only effectively meet the users' information demand, but can also promote the development of automatic text classification. Taking Chinese Library Classification and Sina portal for examples respectively, this paper studies the advantages and disadvantages between traditional document classification and taxonomy of network information, come up with the design principles of the internet information classification system, namely practical, scientific and balance. Based on these three design principles, an internet information classification system was built. In order to verify the validity of the classification system, the web crawler is used to grab corpus of www.163.com and www.qq.com which are as experimental data, and Fudan Corpus classification system is used for the comparative experiment. Experimental results show that, compared to the Fudan Corpus classification system, the proposed Internet Information Classification System has a higher practicality, and can more comprehensively cover all kinds of Internet information, little intersections among categories, more approach between the information of each category, the text classification efficiency is quietly improved.