软件工程师
軟件工程師
연건공정사
Software Engineer
2015年
12期
64-68
,共5页
停用词%候选分词%置信度%抽取新词
停用詞%候選分詞%置信度%抽取新詞
정용사%후선분사%치신도%추취신사
stop words%the candidate segmentation%conifdence%extraction of new words
一种无需语料库和复杂数学模型支持的抽取新词最简方法。通过扫描文献文字流,消除停用字词,切分单元子句,对子句枚举可能的候选词条,统计候选词条频度,计算长短包含关系候选词之间的置信度值,只须依据大于90%的值来消除短词,得到候选关键词,再经过已有词库过滤,留下新词。该方法可作为信息加工的辅助工具。
一種無需語料庫和複雜數學模型支持的抽取新詞最簡方法。通過掃描文獻文字流,消除停用字詞,切分單元子句,對子句枚舉可能的候選詞條,統計候選詞條頻度,計算長短包含關繫候選詞之間的置信度值,隻鬚依據大于90%的值來消除短詞,得到候選關鍵詞,再經過已有詞庫過濾,留下新詞。該方法可作為信息加工的輔助工具。
일충무수어료고화복잡수학모형지지적추취신사최간방법。통과소묘문헌문자류,소제정용자사,절분단원자구,대자구매거가능적후선사조,통계후선사조빈도,계산장단포함관계후선사지간적치신도치,지수의거대우90%적치래소제단사,득도후선관건사,재경과이유사고과려,류하신사。해방법가작위신식가공적보조공구。
A complicated mathematical model without corpus and support the minimalist approach the extraction of words.By scanning the literature text flow,eliminate stop using words,segmentation unit clause,for other enumeration possible candidates for entry,the statistical frequency of candidate items,calculate length of conifdence value contains the relationship between the candidate words,simply on the basis of more than 90% of the value to eliminate the short term,to get the candidate keywords,repass existing thesaurus ifltering,new words.This method can be used as an auxiliary tool to information processing.