机构地区: 同济大学经济与管理学院
出 处: 《情报学报》 2013年第4期376-384,共9页
摘 要: 针对句子和段落两种粒度的语料,采用机器统计学习方法,对可能影响中文网络评论情感分类效果的因素进行实验研究。选取N-gram作为情感文本的潜在特征项,利用文档频率、X2统计量以及期望交叉熵对特征项实施降维处理,采用布尔权重法构建特征向量,并采用SVM分类器进行网络评论的情感分类。研究发现,语料的粒度对分类准确率的影响较大,句子粒度和段落粒度的分类准确率约相差10%;特征降维方法对句子和段落的分类准确率都有一定影响,且分类效果各有优劣,因此应根据不同需要进行选择;Unigram、Bigram分类效果的优劣受到语料粒度和特征降维方法的影响,因此并非一成不变。 With sentences and paragraphs as samples, the effects of various factors on sentiment classification accuracy in Chinese online reviews are discussed. N-grams are selected as the potential sentimental features. The Document Frequency, Chi-square Statistic and Expected Cross Entropy methods are used to reduce feature dimensionality. The Boolean Weighting method is adopted to calculate feature weight and SVM classifier is adopted to classify online reviews. At last, experiments based on online reviews of sentences and paragraphs are conducted . The results showed that : the particle size strongly affect the classification performance of Chinese online reviews. Classification accuracy of sentences is higher than the classification accuracy of paragraphs. The dimension reduction methods also affect the classification performance, and each method has advantages and disadvantages. Therefore, the dimension reduction methods should be selected according to different circumstances. The classification performance of Unigram and Bigram is affected by particle size and the dimension reduction methods, so, it is variable.
领 域: [文化科学]