计算机研究与发展
計算機研究與髮展
계산궤연구여발전
JOURNAL OF COMPUTER RESEARCH AND DEVELOPMENT
2010年
2期
336-343
,共8页
汉语基本块%汉语功能块%条件随机场模型%句法分析%序列标注
漢語基本塊%漢語功能塊%條件隨機場模型%句法分析%序列標註
한어기본괴%한어공능괴%조건수궤장모형%구법분석%서렬표주
Chinese base chunk%Chinese functional chunk%conditional random fields%syntactic parsing%sequence labeling
汉语组块分析是将汉语句子中的词首先组合成基本块,进一步组合形成句子的功能块,最终形成一个具有层次组合结构的汉语句法描述结构.将汉语功能块的自动标注问题看作序列标注任务,并使用词和基本块作为标注单元分别建立标注模型.针对不同的标注模型,分别构建基本块层面的特征集合,并使用条件随机场模型进行汉语功能块的自动标注.实验数据来自清华大学TCT语料库,并且按照8∶2的比例切分形成训练集和测试集.实验结果表明,与仅使用词层面信息的标注模型相比,基本块特征信息的适当加入可以显著提高功能块标注性能.当使用人工标注的基本块信息时,汉语功能块自动标注的准确率达到88. 47%,召回率达到89. 93%,F值达到89. 19%.当使用自动标注的基本块信息时,汉语功能块的标注的准确率为84. 27%,召回率为85. 57%,F值为84. 92%.
漢語組塊分析是將漢語句子中的詞首先組閤成基本塊,進一步組閤形成句子的功能塊,最終形成一箇具有層次組閤結構的漢語句法描述結構.將漢語功能塊的自動標註問題看作序列標註任務,併使用詞和基本塊作為標註單元分彆建立標註模型.針對不同的標註模型,分彆構建基本塊層麵的特徵集閤,併使用條件隨機場模型進行漢語功能塊的自動標註.實驗數據來自清華大學TCT語料庫,併且按照8∶2的比例切分形成訓練集和測試集.實驗結果錶明,與僅使用詞層麵信息的標註模型相比,基本塊特徵信息的適噹加入可以顯著提高功能塊標註性能.噹使用人工標註的基本塊信息時,漢語功能塊自動標註的準確率達到88. 47%,召迴率達到89. 93%,F值達到89. 19%.噹使用自動標註的基本塊信息時,漢語功能塊的標註的準確率為84. 27%,召迴率為85. 57%,F值為84. 92%.
한어조괴분석시장한어구자중적사수선조합성기본괴,진일보조합형성구자적공능괴,최종형성일개구유층차조합결구적한어구법묘술결구.장한어공능괴적자동표주문제간작서렬표주임무,병사용사화기본괴작위표주단원분별건립표주모형.침대불동적표주모형,분별구건기본괴층면적특정집합,병사용조건수궤장모형진행한어공능괴적자동표주.실험수거래자청화대학TCT어료고,병차안조8∶2적비례절분형성훈련집화측시집.실험결과표명,여부사용사층면신식적표주모형상비,기본괴특정신식적괄당가입가이현저제고공능괴표주성능.당사용인공표주적기본괴신식시,한어공능괴자동표주적준학솔체도88. 47%,소회솔체도89. 93%,F치체도89. 19%.당사용자동표주적기본괴신식시,한어공능괴적표주적준학솔위84. 27%,소회솔위85. 57%,F치위84. 92%.
In the schema of Chinese chunking, the words are firstly combined into base-chunks, and then the base-chunks are further combined into functional chunks, and finally formalized into a hierarchical syntactic structure. In this paper, the problem of automatic labeling of Chinese functional chunks is modeled as a sequential labeling task, and then words and base chunks are regarded as labeling units of the Chinese functional chunk labeling models. For each of the labeling models a series of new features on the level of base-chunks are constructed, and conditional random fields model is employed in the model. The data set in the experiments is selected from Tsinghua Chinese Treebank (TCT) corpus, and experimental results show that in comparison with the model in which the feature set at word level is only used, the addition of some base-chunk features can significantly improve the performance of functional chunk labeling. The proposed functional chunk labeling method based on human-corrected base-chunks can achieve precision of 88. 47%, recall of 89. 93% and F-measure of 89. 19%. When auto-parsed base-chunks are used, the labeling of Chinese functional chunks achieves precision of 84. 27%, recall of 85.57% and F-measure of 84. 92%.