机构地区: 中国科学院计算技术研究所
出 处: 《中文信息学报》 2007年第5期14-17,30,共5页
摘 要: 本文提出了一种利用双字耦合度和t-测试差解决中文分词中交叉歧义的方法:首先利用词典找出所有的交叉歧义,然后用双字耦合度和t-测试差的线性叠加值来判断各歧义位置是否该切分。实验结果表明,双字耦合度和t-测试差的结合要优于互信息和t-测试差的结合,因此,用双字耦合度和t-测试差的线性叠加值来消除交叉歧义是一种简单有效的方法。 In this paper, two statistical measures-Coupling Degree of Double Characters (CDDC) and Difference of t- test (DT), are applied for overlapping ambiguity resolution in Chinese word segmentation. First, all possible overlapping ambiguities are found out by using the segmentation dictionary, and then a simple linear combination of CD- DC and DT is used for ambiguity resolution. The experimental results show that our method performed better than the combination of Mutual Information of Double Characters and DT, which was proved to be a very effective method for overlapping ambiguity resolution in previous work.
关 键 词: 计算机应用 中文信息处理 中文分词 双字耦合度 测试差
领 域: [自动化与计算机技术] [自动化与计算机技术]