计算机工程
計算機工程
계산궤공정
COMPUTER ENGINEERING
2015年
2期
145-150
,共6页
张志昌%陈松毅%刘鑫%马慧芳
張誌昌%陳鬆毅%劉鑫%馬慧芳
장지창%진송의%류흠%마혜방
上下位关系%语境相似度%布朗聚类相似度%点互信息%模式匹配%聚类验证
上下位關繫%語境相似度%佈朗聚類相似度%點互信息%模式匹配%聚類驗證
상하위관계%어경상사도%포랑취류상사도%점호신식%모식필배%취류험증
hyponymy relation%context similarity%Brown clustering similarity%Point Mutual Information ( PMI )%pattern matching%clustering validation
对海量文本语料进行上下位语义关系自动抽取是自然语言处理的重要内容,利用简单模式匹配方法抽取得到候选上下位关系后,对其进行验证过滤是难点问题。为此,分别通过对词汇语境相似度与布朗聚类相似度计算,提出一种结合语境相似度和布朗聚类相似度特征对候选下位词集合进行聚类的上下位关系验证方法。通过对少量已标注训练语料的语境相似度和布朗聚类相似度进行计算,得到验证模型和2种相似度的结合权重系数。该方法无需借助现有的词汇关系词典和知识库,可对上下位关系抽取结果进行有效过滤。在CCF NLP&2012词汇语义关系评测语料上进行实验,结果表明,与模式匹配和上下文比较等方法相比,该方法可使 F 值指标得到明显提升。
對海量文本語料進行上下位語義關繫自動抽取是自然語言處理的重要內容,利用簡單模式匹配方法抽取得到候選上下位關繫後,對其進行驗證過濾是難點問題。為此,分彆通過對詞彙語境相似度與佈朗聚類相似度計算,提齣一種結閤語境相似度和佈朗聚類相似度特徵對候選下位詞集閤進行聚類的上下位關繫驗證方法。通過對少量已標註訓練語料的語境相似度和佈朗聚類相似度進行計算,得到驗證模型和2種相似度的結閤權重繫數。該方法無需藉助現有的詞彙關繫詞典和知識庫,可對上下位關繫抽取結果進行有效過濾。在CCF NLP&2012詞彙語義關繫評測語料上進行實驗,結果錶明,與模式匹配和上下文比較等方法相比,該方法可使 F 值指標得到明顯提升。
대해량문본어료진행상하위어의관계자동추취시자연어언처리적중요내용,이용간단모식필배방법추취득도후선상하위관계후,대기진행험증과려시난점문제。위차,분별통과대사회어경상사도여포랑취류상사도계산,제출일충결합어경상사도화포랑취류상사도특정대후선하위사집합진행취류적상하위관계험증방법。통과대소량이표주훈련어료적어경상사도화포랑취류상사도진행계산,득도험증모형화2충상사도적결합권중계수。해방법무수차조현유적사회관계사전화지식고,가대상하위관계추취결과진행유효과려。재CCF NLP&2012사회어의관계평측어료상진행실험,결과표명,여모식필배화상하문비교등방법상비,해방법가사 F 치지표득도명현제승。
Hyponymy has many important applications in the field of Natural Language Processing ( NLP ) and the automatic extraction of hyponym relation from massive text datasets is naturally one of important NLP research tasks. The emphasis and difficult point of the research is how to validate a hyponym which is extracted with simple pattern matching method is really correct. By calculating the context feature similarity ( SimCF ) and Brown clustering similarity ( SimBrown ) , this paper proposes a novel approach of hyponymy validation. It applies a clustering on hyponym candidates, and the clustering similarity feature is obtained by combining SimCF and SimBrown. The combination coefficient of two kinds of similarity is derived based on the SimCFs and SimBrowns between all labeled training words and their hyponyms. The model can filter roughly extraction results without any existed lexical relation dictionary or knowledge base. Evaluation on CCF NLP&CC2012 word semantic relation corpus shows that the proposed approach in this paper significantly improves the F measure value compared with other approaches including pattern matching and simple context comparison.