山东大学学报(理学版)
山東大學學報(理學版)
산동대학학보(이학판)
Journal of Shandong University (Natural Science)
2015年
9期
21-28
,共8页
唐亮%李倩%许洪波%易绵竹
唐亮%李倩%許洪波%易綿竹
당량%리천%허홍파%역면죽
平行语料库%多词短语%词对齐
平行語料庫%多詞短語%詞對齊
평행어료고%다사단어%사대제
parallel corpus%multi-word phrase%word alignment
在跨语言文本分析任务中,多词短语比单个词汇歧义小,语义表达更加准确,有助于提高文本理解的准确性。现有方法主要关注单个词的跨语言对齐。将多词短语抽取和跨语言对齐相融合,提出了一种基于多策略过滤的汉日多词短语抽取和对齐的方法。首先从一个语种出发,通过重复串、左右邻接熵、内部关联度、多词嵌套、停用词等方法提取并过滤得到具备完整语义的多词短语,然后利用平行语料库计算汉日多词短语的相似度,实现跨语言对齐。在整个过程中可结合日语语言规则与特点,根据语料规模、相关领域对过滤阈值进行动态调整,提高了多词短语的领域适用性。实验结果表明,该方法可有效抽取汉日多词短语并进行准确对齐,以多词短语为对齐单元,语义表达更完整,实用价值更大。
在跨語言文本分析任務中,多詞短語比單箇詞彙歧義小,語義錶達更加準確,有助于提高文本理解的準確性。現有方法主要關註單箇詞的跨語言對齊。將多詞短語抽取和跨語言對齊相融閤,提齣瞭一種基于多策略過濾的漢日多詞短語抽取和對齊的方法。首先從一箇語種齣髮,通過重複串、左右鄰接熵、內部關聯度、多詞嵌套、停用詞等方法提取併過濾得到具備完整語義的多詞短語,然後利用平行語料庫計算漢日多詞短語的相似度,實現跨語言對齊。在整箇過程中可結閤日語語言規則與特點,根據語料規模、相關領域對過濾閾值進行動態調整,提高瞭多詞短語的領域適用性。實驗結果錶明,該方法可有效抽取漢日多詞短語併進行準確對齊,以多詞短語為對齊單元,語義錶達更完整,實用價值更大。
재과어언문본분석임무중,다사단어비단개사회기의소,어의표체경가준학,유조우제고문본리해적준학성。현유방법주요관주단개사적과어언대제。장다사단어추취화과어언대제상융합,제출료일충기우다책략과려적한일다사단어추취화대제적방법。수선종일개어충출발,통과중복천、좌우린접적、내부관련도、다사감투、정용사등방법제취병과려득도구비완정어의적다사단어,연후이용평행어료고계산한일다사단어적상사도,실현과어언대제。재정개과정중가결합일어어언규칙여특점,근거어료규모、상관영역대과려역치진행동태조정,제고료다사단어적영역괄용성。실험결과표명,해방법가유효추취한일다사단어병진행준학대제,이다사단어위대제단원,어의표체경완정,실용개치경대。
On the task of cross-language text analysis,a multi-word phrase is less ambiguous and more accurate than a single word,which helps to understand the text more accurately.Existing methods mainly focus on cross-language alignment of single words.This paper presents an extraction and alignment method for Chinese-Japanese multi-word phrases based on multi-strategy filtering,which combines the multi-word phrases extraction and cross-language align-ment.Firstly,we get multi-word phrases with complete semantics using repeated string,left-right adjacent entropy,in-ternal relationship,multi-word nesting,stop-word method etc.Secondly,we use the parallel corpus to compute the similarity of Chinese-Japanese multi-word phrases,to achieve cross-language alignment.In the process,according to the rules and characteristics of the Japanese language,we dynamically adjust the threshold according to corpus’size and related domains,in order to improve the applicability of multi-word phrases.The experimental results show that this method is effective to extract Chinese-Japanese multi-word phrases as the alignment unit,which makes the semantic ex-pression more complete and more practical value.