计算机工程
計算機工程
계산궤공정
COMPUTER ENGINEERING
2014年
12期
199-204
,共6页
中文编码%网页过滤%高频字符%模式匹配%有限状态自动机
中文編碼%網頁過濾%高頻字符%模式匹配%有限狀態自動機
중문편마%망혈과려%고빈자부%모식필배%유한상태자동궤
Chinese encoding%Web filtering%high frequency characters%pattern matching%finite state automata
编码识别是网页内容过滤的必要前提,多种中文编码共存给中文网页的内容过滤带来不便。针对上述问题,提出一种基于字频分布的中文网页编码识别算法。根据汉字的使用频率,选取使用频度较高的字符构成高频字符编码表,以高频字符编码作为关键字,使用改进的模式匹配算法查找待识别网页,并统计匹配次数。将编码的匹配结果作为分析的依据,最终判定待识别网页的真实码制。实验结果证明,与Unigram算法相比,该算法对目前通用的中文编码识别率较高,适合对未知编码的中文网页进行快速编码识别。
編碼識彆是網頁內容過濾的必要前提,多種中文編碼共存給中文網頁的內容過濾帶來不便。針對上述問題,提齣一種基于字頻分佈的中文網頁編碼識彆算法。根據漢字的使用頻率,選取使用頻度較高的字符構成高頻字符編碼錶,以高頻字符編碼作為關鍵字,使用改進的模式匹配算法查找待識彆網頁,併統計匹配次數。將編碼的匹配結果作為分析的依據,最終判定待識彆網頁的真實碼製。實驗結果證明,與Unigram算法相比,該算法對目前通用的中文編碼識彆率較高,適閤對未知編碼的中文網頁進行快速編碼識彆。
편마식별시망혈내용과려적필요전제,다충중문편마공존급중문망혈적내용과려대래불편。침대상술문제,제출일충기우자빈분포적중문망혈편마식별산법。근거한자적사용빈솔,선취사용빈도교고적자부구성고빈자부편마표,이고빈자부편마작위관건자,사용개진적모식필배산법사조대식별망혈,병통계필배차수。장편마적필배결과작위분석적의거,최종판정대식별망혈적진실마제。실험결과증명,여Unigram산법상비,해산법대목전통용적중문편마식별솔교고,괄합대미지편마적중문망혈진행쾌속편마식별。
Web coding identification is the premise of webpage content filtering,and coexistence of a variety of Chinese encoding makes Chinese webpage coded identification inconvenient. This paper presents a Chinese Web encoding identification algorithm———FKI ( Frequency Keyword Identification ) which is based on the frequency of Chinese character used. FKI selects the frequency of high character to construct high frequency character encoding tables, according to the frequency of the use of Chinese characters. Using high frequency character encoding as a keyword,FKI algorithm scans the Webpage by improved pattern matching algorithm, statistical matching number, and determines the real code of webpage based on the matching result. Experimental results show that, compared with the Unigram algorithm,this algorithm has a higher recognition rate. FKI algorithm is suitable for Chinese webpage which is unknown code to identify code quickly and accurately.