中文信息学报
中文信息學報
중문신식학보
JOURNAL OF CHINESE INFORMAITON PROCESSING
2010年
2期
24-32
,共9页
唐旭日%陈小荷%许超%李斌
唐旭日%陳小荷%許超%李斌
당욱일%진소하%허초%리빈
计算机应用%中文信息处理%篇章地名关系%条件随机场%地名性判断
計算機應用%中文信息處理%篇章地名關繫%條件隨機場%地名性判斷
계산궤응용%중문신식처리%편장지명관계%조건수궤장%지명성판단
computer application%Chinese information processing%discourse-based location name relation%conditional random fields%toponymhood calculation
该文介绍了以篇章为单位的中文地名识别方法和系统实现.地名识别包括简单地名识别和复杂地名识别两个阶段.简单地名识别由基于条件随机场的识别模块和基于篇章地名关系的识别模块顺序构成,以原始文本为输入,直接利用地名内部结构和相邻字信息进行地名识别和文本分词,然后利用篇章地名关系和地名性判断进一步处理.复杂地名识别以简单地名识别结果为输入,采用条件随机场识别.系统在封闭测试和开放测试中F-1值分别达到92.87%和89.76%.研究发现,在地名性判断中地名确信度低的字串对于地名识别干扰性较大,篇章地名关系能够在不降低识别精确度的情况下有效提高召回率,综合利用地名短距离和长距离依存关系可以有效提高地名识别效果.
該文介紹瞭以篇章為單位的中文地名識彆方法和繫統實現.地名識彆包括簡單地名識彆和複雜地名識彆兩箇階段.簡單地名識彆由基于條件隨機場的識彆模塊和基于篇章地名關繫的識彆模塊順序構成,以原始文本為輸入,直接利用地名內部結構和相鄰字信息進行地名識彆和文本分詞,然後利用篇章地名關繫和地名性判斷進一步處理.複雜地名識彆以簡單地名識彆結果為輸入,採用條件隨機場識彆.繫統在封閉測試和開放測試中F-1值分彆達到92.87%和89.76%.研究髮現,在地名性判斷中地名確信度低的字串對于地名識彆榦擾性較大,篇章地名關繫能夠在不降低識彆精確度的情況下有效提高召迴率,綜閤利用地名短距離和長距離依存關繫可以有效提高地名識彆效果.
해문개소료이편장위단위적중문지명식별방법화계통실현.지명식별포괄간단지명식별화복잡지명식별량개계단.간단지명식별유기우조건수궤장적식별모괴화기우편장지명관계적식별모괴순서구성,이원시문본위수입,직접이용지명내부결구화상린자신식진행지명식별화문본분사,연후이용편장지명관계화지명성판단진일보처리.복잡지명식별이간단지명식별결과위수입,채용조건수궤장식별.계통재봉폐측시화개방측시중F-1치분별체도92.87%화89.76%.연구발현,재지명성판단중지명학신도저적자천대우지명식별간우성교대,편장지명관계능구재불강저식별정학도적정황하유효제고소회솔,종합이용지명단거리화장거리의존관계가이유효제고지명식별효과.
The paper presents a system for the recognition of Chinese location names on the discourse level. The system employs three modules in sequence, the CRFs-based module for simple location name recognition, the discourse-based module for the relationship identification between the simple location names and the CRFs-based module for complex location name recognition. The CRFs-based module for single location name recognition takes raw text as input and models both the information of internal structure of basic location names and the information of neighboring characters. The discourse-based module employs toponymhood calculation and discourse-based location name relation for recognition. The module of complex location name recognition is also based on CRFs but operates on the result of single toponym recognition. Experiments show that the system achieves the F-scores of 92.87% and 89.76% in close and open tests respectively.