计算机工程与应用
計算機工程與應用
계산궤공정여응용
COMPUTER ENGINEERING AND APPLICATIONS
2014年
4期
132-139
,共8页
中文句子压缩%热词%语言学%句法分析树
中文句子壓縮%熱詞%語言學%句法分析樹
중문구자압축%열사%어언학%구법분석수
Chinese sentence compression%hot word%linguistic%parse tree
传统的句子压缩方法多基于难以获得的“原句-压缩句”对齐语料库,因此提出了不依赖于对齐语料库的中文句子压缩算法。通过研究人工压缩结果并结合语言学知识,提出了词语层面和分句层面的两组压缩规则。算法在原句句法分析树和词语间依赖关系的基础上,使用两组规则进行压缩,同时为了保证压缩算法具有更强的适应性和准确性,引入词语的热度加强了压缩算法,最后通过句子整理和语法修复得到最终的压缩句。对比了人工压缩、只使用规则压缩和引入词语热度压缩三种压缩方法。实验结果表明,基于热度的启发式中文句子压缩算法可以在压缩比、语法性、信息量都损失较少的情况下,提高压缩句的热度。
傳統的句子壓縮方法多基于難以穫得的“原句-壓縮句”對齊語料庫,因此提齣瞭不依賴于對齊語料庫的中文句子壓縮算法。通過研究人工壓縮結果併結閤語言學知識,提齣瞭詞語層麵和分句層麵的兩組壓縮規則。算法在原句句法分析樹和詞語間依賴關繫的基礎上,使用兩組規則進行壓縮,同時為瞭保證壓縮算法具有更彊的適應性和準確性,引入詞語的熱度加彊瞭壓縮算法,最後通過句子整理和語法脩複得到最終的壓縮句。對比瞭人工壓縮、隻使用規則壓縮和引入詞語熱度壓縮三種壓縮方法。實驗結果錶明,基于熱度的啟髮式中文句子壓縮算法可以在壓縮比、語法性、信息量都損失較少的情況下,提高壓縮句的熱度。
전통적구자압축방법다기우난이획득적“원구-압축구”대제어료고,인차제출료불의뢰우대제어료고적중문구자압축산법。통과연구인공압축결과병결합어언학지식,제출료사어층면화분구층면적량조압축규칙。산법재원구구법분석수화사어간의뢰관계적기출상,사용량조규칙진행압축,동시위료보증압축산법구유경강적괄응성화준학성,인입사어적열도가강료압축산법,최후통과구자정리화어법수복득도최종적압축구。대비료인공압축、지사용규칙압축화인입사어열도압축삼충압축방법。실험결과표명,기우열도적계발식중문구자압축산법가이재압축비、어법성、신식량도손실교소적정황하,제고압축구적열도。
Since the parallel sentence/compression corpora which most of the traditional methods based on are not easy to obtain, a linguistically-motivated heuristics Chinese sentence compression algorithm is proposed after studying traditional methods. By analyzing the human-produced compression and linguistic knowledge, two sets of rules are proposed, one is in word layer and the other is in clause layer. Two sets of rules based on the parse tree and the words dependence are used to compress sentence, and enhance the algorithm by hot word in order to keep the algorithm flexibility and accuracy. In the last step the compression result is cleaned and repaired. Human-produced compression, rule-only algorithm and hot word enhanced algorithm are compared then the results are evaluated in compression rate, grammaticality, informative-ness and heat. The experimental results show that heuristic Chinese sentence compression algorithm based on hot word can improve the heat of compression results without much loss in compression rate, grammaticality and informativeness.