智能计算机与应用
智能計算機與應用
지능계산궤여응용
Computer Study
2014年
4期
1-4,8
,共5页
开放域命名实体识别%自学习%训练语料融合
開放域命名實體識彆%自學習%訓練語料融閤
개방역명명실체식별%자학습%훈련어료융합
Open -domain Named Entity Recognition%Self -training%Training Corpus Combination
命名实体识别是自然语言处理领域的一个重要任务,为许多上层应用提供支持。本文主要研究汉语开放域命名实体边界的识别。由于目前该任务尚缺乏训练语料,而人工标注语料的代价又太大,本文首先基于双语平行语料和英语句法分析器自动标注了一个汉语专有名词语料,另外基于汉语依存树库生成了一个名词复合短语语料,然后使用自学习方法将这两部分语料融合形成命名实体边界识别语料,同时训练边界识别模型。实验结果表明自学习的方法可以提高边界识别的准确率和召回率。
命名實體識彆是自然語言處理領域的一箇重要任務,為許多上層應用提供支持。本文主要研究漢語開放域命名實體邊界的識彆。由于目前該任務尚缺乏訓練語料,而人工標註語料的代價又太大,本文首先基于雙語平行語料和英語句法分析器自動標註瞭一箇漢語專有名詞語料,另外基于漢語依存樹庫生成瞭一箇名詞複閤短語語料,然後使用自學習方法將這兩部分語料融閤形成命名實體邊界識彆語料,同時訓練邊界識彆模型。實驗結果錶明自學習的方法可以提高邊界識彆的準確率和召迴率。
명명실체식별시자연어언처리영역적일개중요임무,위허다상층응용제공지지。본문주요연구한어개방역명명실체변계적식별。유우목전해임무상결핍훈련어료,이인공표주어료적대개우태대,본문수선기우쌍어평행어료화영어구법분석기자동표주료일개한어전유명사어료,령외기우한어의존수고생성료일개명사복합단어어료,연후사용자학습방법장저량부분어료융합형성명명실체변계식별어료,동시훈련변계식별모형。실험결과표명자학습적방법가이제고변계식별적준학솔화소회솔。
Named entity recognition is an important task in the domain of Natural Language Processing,which plays an im-portant role in many applications.This paper focuses on the boundary identification of Chinese open -domain named enti-ties.Because the shortage of training data and the huge cost of manual annotation,the paper proposes a self -training ap-proach to identify the boundaries of Chinese open -domain named entities in context.Due to the lack of training data,the paper firstly generates a large scale Chinese proper noun corpus based on parallel corpora,and also transforms a Chinese dependency tree bank to a noun compound training corpus.Subsequently,the paper proposes a self -training -based ap-proach to combine the two corpora and train a model to identify boundaries of named entities.The experiments show the proposed method can take full advantage of the two corpora and improve the performance of named entity boundary identifi-cation.