计算机工程
計算機工程
계산궤공정
COMPUTER ENGINEERING
2015年
5期
202-206,212
,共6页
马慧芳%贾美惠子%袁媛%张志昌
馬慧芳%賈美惠子%袁媛%張誌昌
마혜방%가미혜자%원원%장지창
微博%词项关联关系%成对约束%半监督聚类%非负矩阵分解
微博%詞項關聯關繫%成對約束%半鑑督聚類%非負矩陣分解
미박%사항관련관계%성대약속%반감독취류%비부구진분해
microblog%term correlation relationship%pair-wise constraints%semi-supervised clustering%non-negative matrix factorization
针对微博文本内容短、稀疏、高维等特点,提出一种改进的半监督微博聚类算法。该算法利用词项间的关系丰富文本特征,通过定义词项文档间关联关系和词项文档内关联关系揭示词项间语义的关联程度,并由此自动生成有标记的数据来指导聚类过程。对词项先验信息进行成对约束编码,构建基于词项间成对约束的三重非负矩阵分解模型来实现微博的半监督聚类。实验结果表明,该算法可以减少繁琐的人工标记过程,并能高效地进行微博聚类。
針對微博文本內容短、稀疏、高維等特點,提齣一種改進的半鑑督微博聚類算法。該算法利用詞項間的關繫豐富文本特徵,通過定義詞項文檔間關聯關繫和詞項文檔內關聯關繫揭示詞項間語義的關聯程度,併由此自動生成有標記的數據來指導聚類過程。對詞項先驗信息進行成對約束編碼,構建基于詞項間成對約束的三重非負矩陣分解模型來實現微博的半鑑督聚類。實驗結果錶明,該算法可以減少繁瑣的人工標記過程,併能高效地進行微博聚類。
침대미박문본내용단、희소、고유등특점,제출일충개진적반감독미박취류산법。해산법이용사항간적관계봉부문본특정,통과정의사항문당간관련관계화사항문당내관련관계게시사항간어의적관련정도,병유차자동생성유표기적수거래지도취류과정。대사항선험신식진행성대약속편마,구건기우사항간성대약속적삼중비부구진분해모형래실현미박적반감독취류。실험결과표명,해산법가이감소번쇄적인공표기과정,병능고효지진행미박취류。
A novel semi-supervised learning algorithm fully exploring the inner semantic information to compensate for the limited message length is presented. The key idea is to explore term correlation data,which well captures the semantic information for term weighting and provides greater context for short texts. Direct and indirect dependency weights between terms are defined to reveal the semantic correlation between terms. Must-link and cannot-link are encoded as constraints for terms. This paper formulates microblog clustering problem as a semi-supervised non-negative matrix factorization co-clustering framework,which takes advantage of knowledge of features as pair-wise constraints. Extensive experiments are conducted on two real-world microblog datasets. Experimental results show that the effectiveness of the proposed algorithm. It not only greatly reduces the labor-intensive labeling process,but also deeply exploits the hidden information from microblog itself.