山东大学学报(理学版)
山東大學學報(理學版)
산동대학학보(이학판)
JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE)
2015年
1期
26-30
,共5页
不一致%潜在错误%汉语树库%自然语言处理
不一緻%潛在錯誤%漢語樹庫%自然語言處理
불일치%잠재착오%한어수고%자연어언처리
inconsistencies%potential error%Chinese treebank%natural language processing
语料库是自然语言处理NLP(natural language processing)的基础,其标注质量影响着基于有指导机器学习方法的NLP系统的性能。针对汉语句法树库,提出了一种基于不一致查找树库潜在标注错误的方法,该方法主要从两方面进行不一致检测:一是从类似短语内部构成并结合可疑度来检测不一致;二是从标注大纲入手,检测词性、短语等各类标记符号与大纲定义不符合的情况。实验结果表明,在查找到的不一致现象中,存在一定数量的语料库标注错误。
語料庫是自然語言處理NLP(natural language processing)的基礎,其標註質量影響著基于有指導機器學習方法的NLP繫統的性能。針對漢語句法樹庫,提齣瞭一種基于不一緻查找樹庫潛在標註錯誤的方法,該方法主要從兩方麵進行不一緻檢測:一是從類似短語內部構成併結閤可疑度來檢測不一緻;二是從標註大綱入手,檢測詞性、短語等各類標記符號與大綱定義不符閤的情況。實驗結果錶明,在查找到的不一緻現象中,存在一定數量的語料庫標註錯誤。
어료고시자연어언처리NLP(natural language processing)적기출,기표주질량영향착기우유지도궤기학습방법적NLP계통적성능。침대한어구법수고,제출료일충기우불일치사조수고잠재표주착오적방법,해방법주요종량방면진행불일치검측:일시종유사단어내부구성병결합가의도래검측불일치;이시종표주대강입수,검측사성、단어등각류표기부호여대강정의불부합적정황。실험결과표명,재사조도적불일치현상중,존재일정수량적어료고표주착오。
Corpora are fundamental to natural language processing(NLP)and corpus annotation quality influences the performance of the systems based on supervised machine learning approaches.Aiming at Chinese treebank,an approach was proposed to find potential errors based on inconsistencies.Inconsistencies were detected with two strategies:one uses similar internal structure and suspicious degree,the other uses the annotation guideline to check those annotations, which don't meet the definitions of the guideline.Experimental results show that there are some annotation errors in the inconsistencies.