作 者: (张绍阳); (曹家波); (王子凡); (曲卫东);
机构地区: 长安大学信息工程学院,西安710064
出 处: 《计算机工程与应用》 2017年第18期95-101,共7页
摘 要: 为了改进传统以向量空间模型(VSM)为代表的基于词频统计的方法在中文段落相似度计算时存在的精度不高问题,在基于加权二部图匹配的思想上提出了一种计算中文段落之间相似度的方法。该方法将相似度计算分为段落和句子两个层次,将句子作为简单段落看待,也使用二部图匹配进行相似度计算。首先利用句子主干词汇提取算法来提取句子的主干词汇,将主干词汇作为二部图的顶点,把主干词汇之间的相似度作为二部图顶点之间的权值系数,进行句子相似度的计算。其次,将句子作为加权二部图的顶点,把句子之间的相似度作为二部图顶点之间的权值系数,进行段落之间的相似度计算。实验结果表明,该方法与VSM相比,由于它能准确识别同义词,自动匹配两个在段落中不同位置的相似词语,因而在准确度上有了很大的提高。 In order to improve the low accuracy of the statistical method that is represented by the traditional VectorSpace Model(VSM)and based on word frequency in Chinese paragraph similarity computing,this thesis proposes amethod to compute Chinese paragraph similarity on the basis of weighted bipartite graph matching.The similarity computingmethod will be divided into two levels:paragraphs and sentences.Thus,sentences can be treated as paragraphs andcalculated the similarity by using bipartite graph matching.First of all,it utilizes key words extraction algorithm to extractthe main vocabulary backbone of the sentences,using the main vocabulary as vertex of weighted bipartite graph to calculatesimilarity of sentences.Secondly,it calculates the paragraph similarity by using the sentence as a vertex of weightedbipartite graph,and the similarity between sentences as the weight coefficient between the vertex of weighted bipartitegraph.Experimental results show that the proposed method has been greatly increased in accuracy compared with VSM,in virtue of its ability to identify synonyms accurately and match two similar words in different locations of paragraphsautomatically.