山东大学学报(工学版)
山東大學學報(工學版)
산동대학학보(공학판)
JOURNAL OF SHANDONG UNIVERSITY(ENGINEERING SCIENCE)
2014年
1期
13-18,23
,共7页
于江德%赵红丹%郑勃举%余正涛
于江德%趙紅丹%鄭勃舉%餘正濤
우강덕%조홍단%정발거%여정도
中文人名%性别判定%朴素贝叶斯分类%用字特征%特征组合%区分特征
中文人名%性彆判定%樸素貝葉斯分類%用字特徵%特徵組閤%區分特徵
중문인명%성별판정%박소패협사분류%용자특정%특정조합%구분특정
Chinese names%gender discrimination%na?ve Bayes classification%character feature%feature combination%distinguishing feature
基于中文人名用字具有的较强的性别区分性,提出一种利用朴素贝叶斯分类器对中文人名性别进行判定的方法,该方法将每个中文人名中的第一个字(字1)、第二个字(字2)、第一和第二个字组合(字1字2)作为区分特征,利用朴素贝叶斯分类方法对该人名所属性别进行判定。在412775个中文人名语料上采用10重交叉验证法进行训练和测试,对比了依据不同区分特征组合进行性别判定的准确率,分别采用字1,字2,字1+字2,字1+字1字2,字2+字1字2,字1+字2+字1字2(全部区分特征)构成的特征组合进行性别判定,平均判定准确率分别为72.75%,86.92%,88.84%,87.37%,89.35%,90.06%,取得的最好平均判定准确率为90.06%。
基于中文人名用字具有的較彊的性彆區分性,提齣一種利用樸素貝葉斯分類器對中文人名性彆進行判定的方法,該方法將每箇中文人名中的第一箇字(字1)、第二箇字(字2)、第一和第二箇字組閤(字1字2)作為區分特徵,利用樸素貝葉斯分類方法對該人名所屬性彆進行判定。在412775箇中文人名語料上採用10重交扠驗證法進行訓練和測試,對比瞭依據不同區分特徵組閤進行性彆判定的準確率,分彆採用字1,字2,字1+字2,字1+字1字2,字2+字1字2,字1+字2+字1字2(全部區分特徵)構成的特徵組閤進行性彆判定,平均判定準確率分彆為72.75%,86.92%,88.84%,87.37%,89.35%,90.06%,取得的最好平均判定準確率為90.06%。
기우중문인명용자구유적교강적성별구분성,제출일충이용박소패협사분류기대중문인명성별진행판정적방법,해방법장매개중문인명중적제일개자(자1)、제이개자(자2)、제일화제이개자조합(자1자2)작위구분특정,이용박소패협사분류방법대해인명소속성별진행판정。재412775개중문인명어료상채용10중교차험증법진행훈련화측시,대비료의거불동구분특정조합진행성별판정적준학솔,분별채용자1,자2,자1+자2,자1+자1자2,자2+자1자2,자1+자2+자1자2(전부구분특정)구성적특정조합진행성별판정,평균판정준학솔분별위72.75%,86.92%,88.84%,87.37%,89.35%,90.06%,취득적최호평균판정준학솔위90.06%。
Based on the strong gender discrimination of Chinese names, a method of gender discrimination based on character feature of Chinese names using na?ve Bayes classifier was presented.In this method, the first character of each Chinese name ( Zi1 ) , the second character ( Zi2 ) , the first and the second characters ( Zi1 Zi2 ) were regarded as distin-guishing features.The na?ve Bayes classification method was used for gender discrimination of Chinese names.Training and testing were done on 412 775 Chinese names corpus using 10 fold cross validation method, and comparative experi-ments were done according to the different feature combinations, they were Zi1 , Zi2 , Zi1 +Zi2 , Zi1 +Zi1 Zi2 , Zi2 +Zi1 Zi2 , Zi1 +Zi2 +Zi1 Zi2 ( all the distinguishing features) .The average accuracy were as followings in turn, 72.75%, 86.92%, 88.84%, 87.37%, 89.35%, 90.06%, of which the best average accuracy was 90.06%.