计算机工程与应用
計算機工程與應用
계산궤공정여응용
COMPUTER ENGINEERING AND APPLICATIONS
2014年
10期
141-146
,共6页
主题爬虫%Context Graph模型%搜索策略%特征选取%TF-IDF
主題爬蟲%Context Graph模型%搜索策略%特徵選取%TF-IDF
주제파충%Context Graph모형%수색책략%특정선취%TF-IDF
focused crawler%Context Graph%search strategy%feature selection%TF-IDF
为了解决传统主题爬虫效率偏低的问题,在分析了启发式网络爬虫搜索算法Context Graph的基础上,提出了一种改进的Context Graph爬虫搜索策略。该策略利用基于词频差异的特征选取方法和改进后的TF-IDF公式对原算法进行了改进,综合考虑了网页不同部分的文本信息对特征选取的影响,及特征词的类间权重和类中权重,以提高特征选取和评价的质量。实验结果表明,与既定传统方法进行实验对照,改进后的策略效率更高。
為瞭解決傳統主題爬蟲效率偏低的問題,在分析瞭啟髮式網絡爬蟲搜索算法Context Graph的基礎上,提齣瞭一種改進的Context Graph爬蟲搜索策略。該策略利用基于詞頻差異的特徵選取方法和改進後的TF-IDF公式對原算法進行瞭改進,綜閤攷慮瞭網頁不同部分的文本信息對特徵選取的影響,及特徵詞的類間權重和類中權重,以提高特徵選取和評價的質量。實驗結果錶明,與既定傳統方法進行實驗對照,改進後的策略效率更高。
위료해결전통주제파충효솔편저적문제,재분석료계발식망락파충수색산법Context Graph적기출상,제출료일충개진적Context Graph파충수색책략。해책략이용기우사빈차이적특정선취방법화개진후적TF-IDF공식대원산법진행료개진,종합고필료망혈불동부분적문본신식대특정선취적영향,급특정사적류간권중화류중권중,이제고특정선취화평개적질량。실험결과표명,여기정전통방법진행실험대조,개진후적책략효솔경고。
In order to solve the low efficiency problem of traditional focused crawler, the heuristic web crawler search algorithm Context Graph is analyzed. However, Context Graph method is deficient. An optimization strategy is proposed by adopting the improved TF-IDF and feature selection method based on word frequency differentia, which takes impor-tance of different web textual content into consideration synthetically. A new method of term weighting is explicated in text categorization which considers feature words among and inside class. Compared with the other given algorithms, experimental results indicate that this strategy is more efficient in crawling the topic pages.