计算机技术与发展
計算機技術與髮展
계산궤기술여발전
COMPUTER TECHNOLOGY AND DEVELOPMENT
2015年
8期
90-93
,共4页
文本相似度%语义%Map/Reduce框架%TFIDF算法%TFIDFWGE算法
文本相似度%語義%Map/Reduce框架%TFIDF算法%TFIDFWGE算法
문본상사도%어의%Map/Reduce광가%TFIDF산법%TFIDFWGE산법
text similarity%semantic%Map/Reduce framework%TFIDF%TFIDFWGE
在现有的文本相似度计算方法中,获取关键词权值的TFIDF算法没有完全考虑到关键词在文本中的位置和其在文本库中的离散度对权值的影响,且当处理的文本库中信息量过大时,运行效率较低。针对上述问题,文中提出一种基于语义的信息熵与信息增益的TFIDF算法( TFIDFWGE )。该算法通过对给定的关键词添加位置权重与计算熵值和信息增益,得到关键词的最终权值,并利用Hadoop平台的Map/Reduce框架来实现TFIDFWGE算法和向量空间模型( VSM)的文本相似度计算过程。通过对两组真实的数据集进行的实验结果表明,与现有的TFIDF算法相比,TFIDFWGE算法的查全率和查准率更高,且在Hadoop平台上实现的文本相似度检测系统对信息量大的文本库处理效率更加高效。
在現有的文本相似度計算方法中,穫取關鍵詞權值的TFIDF算法沒有完全攷慮到關鍵詞在文本中的位置和其在文本庫中的離散度對權值的影響,且噹處理的文本庫中信息量過大時,運行效率較低。針對上述問題,文中提齣一種基于語義的信息熵與信息增益的TFIDF算法( TFIDFWGE )。該算法通過對給定的關鍵詞添加位置權重與計算熵值和信息增益,得到關鍵詞的最終權值,併利用Hadoop平檯的Map/Reduce框架來實現TFIDFWGE算法和嚮量空間模型( VSM)的文本相似度計算過程。通過對兩組真實的數據集進行的實驗結果錶明,與現有的TFIDF算法相比,TFIDFWGE算法的查全率和查準率更高,且在Hadoop平檯上實現的文本相似度檢測繫統對信息量大的文本庫處理效率更加高效。
재현유적문본상사도계산방법중,획취관건사권치적TFIDF산법몰유완전고필도관건사재문본중적위치화기재문본고중적리산도대권치적영향,차당처리적문본고중신식량과대시,운행효솔교저。침대상술문제,문중제출일충기우어의적신식적여신식증익적TFIDF산법( TFIDFWGE )。해산법통과대급정적관건사첨가위치권중여계산적치화신식증익,득도관건사적최종권치,병이용Hadoop평태적Map/Reduce광가래실현TFIDFWGE산법화향량공간모형( VSM)적문본상사도계산과정。통과대량조진실적수거집진행적실험결과표명,여현유적TFIDF산법상비,TFIDFWGE산법적사전솔화사준솔경고,차재Hadoop평태상실현적문본상사도검측계통대신식량대적문본고처리효솔경가고효。
In existing method of calculating similarity,TFIDF which is usually used to obtain weights of key words doesn’ t take into con-sideration the influence of key words’ position and their dispersion in text library,and moreover runs in low efficiency when dealing with large quantity of data. To tackle the problems above,propose a kind of TFIDF algorithm ( TFIDFWGE) based on the semantic informa-tion entropy and information gain by adding position weight to key words and calculating the entropy and information gain to acquire final value. The algorithm adds position weight and calculation entropy and information gain for given keywords to get the final weights of keywords,and use Map/Reduce framework of Hadoop platform to achieve TFIDFWGE algorithms and Vector Space Model ( VSM) in the text similarity calculation process. Experimental results on two real datasets show that compared with the existing TFIDF, TFIDF-WGE’ s recall and precision is higher,and in the Hadoop platform text similarity detection system is more efficient for information large text database processing.