作 者: (楼文高); (熊聘); (冯国珍); (于晓虹);
机构地区: 上海商学院管理学院,上海200235 上海理工大学光电学院计算机工程系,上海200093
出 处: 《数理统计与管理》 2017年第5期783-801,共19页
摘 要: 讨论了由于对Friedman等提出的投影寻踪聚类(PPC)建模基本思想的理解不同而提出的六种目标函数的特点和区别,分析了样本数据三种归一化预处理方法的区别与联系,阐述了四种取不同R值方案的本质和内涵。通过实证研究和理论分析发现,目标函数Q(a)=S_z*D_z不仅应用最广,且最能体现投影寻踪的基本思想,目标函数Q(a)=S_z+D_z存在大数吃小数的问题,目标函数Q(a)=1/S_z+μ*D_z~*仅适用于高相似度的大样本数据情况,但并没有取得更好的效果,目标函数Q(a)=S_z*C*E和Q(a)=S_z*D_z*E通过增加权重信息熵和样本投影值信息熵,但并没有取得更好的聚类效果,目标函数Q(a)=S_z不符合PPC基本建模思想。样本数据不同归一化预处理方法对建模结果有显著影响,极大值归一化方法更能体现样本数据的原始结构特性,极差归一化方法有利于弱化指标之间的权重差异,去均值归一化方法可以弱化异常值的影响。局部密度窗口半径R值对建模结果有显著影响,R取较小值(R≤0.1S_z)方案更有利于区分样本,但不利于聚类,最优化过程有时候无法求得真正的全局最优解。R取较大值(2m≥R≥r_(max))方案的前提、推导过程和结果都是错误的。R=(r_(i,j))_((k))取值方案只有在类内样本之间距离的最大值小于类间样本之间距离的最小值的特殊情况下才具有意义。R在r_(max)/5≤R≤r_(max)/3范围内取适度值的方案是合理的,也与Friedman等提出的选取R合理值的思想是一致的。 The authors discuss the characteristics and the distinctions of the six kinds of optimization objective function established according to the different understanding of the projection pursuit cluster- ing (PPC) modelling put forward by Friedman, et al. The distinctions and relationships between the three data normalization methods are analyzed. The essence and the connotation of the four kinds of solution to determine the cutoff radius (R) are expounded in this paper. The positive research and the theoretical analysis show that the optimization objective function (0OF), Q(α) = Sz * Dz, is not only the most widely used one, but also fully reflecting the original ideas put forward by Friedman, et al. The OOF, Q(α)=Sz + Dz, presents the problem of the large number covering the decimal number. The OOF, Q(α) = 1/Sz + μ* D*, is only applied to the large number of samples with high similarity, but cannot obtain better results. The OOFs, Q(α) = Sz * C * E and Q(a) = Sz * Dz * E, adding the information entropy of weights or samples' projected values into the OOFs, possess no better effects. The OOF, Q(α) = Sz, is not agreement with the PPC original ideas. The methods of the data normalization have significant effects on the PPC results. The scheme of the smaller value R, (R≤ 0.1Sz), is benefit to distinguish samples and against for the clustering, and a few samples contained within the cutoff radius window, the optimization searching occasionally doesn't reach the real global optimization solution. The scheme of the larger value R, (2m ≥ R 〉 rmax), is totally mistaken in the premise, the derivation process and the results. The scheme of the value R = (τi,j)(k) is only suitable to special conditions, the maximum distance between the samples within the same cluster is smaller than the minimum distance between the samples located in the different clusters, and has no generality. The scheme of taking the moderate value R, (τmax/5 ≤ R �