导 师: 文贵华
学科专业: 081203
授予学位: 博士
作 者: ;
机构地区: 华南理工大学
摘 要: 信息技术的迅速发展将人类社会带入大数据时代,人们面临以几何级数快速增长的海量数据。如何从这些海量数据中获取有用的知识是当前及今后相当长时期内全球科研工作者和技术专家所面临的共同挑战之一。此外,越来越多的数据呈现高维的趋势,比如数字图像、语音数据、文本数据以及基因表达谱微阵列数据等,降维技术已成为处理高维数据、克服“维数灾难”的重要途径。传统的维数约减方法虽然能够有效地学习出具有线性结构的高维数据的内在结构,但这类算法的线性本质决定了其不能揭示数据本身的非线性结构,从而不能发现高维数据的内在低维流形结构。为解决这些问题,流形学习则提供了一种有效的思路。然而,在很多实际的机器学习和数据挖掘任务中,人们很容易获得大量未标记数据以及少得可怜的标记数据,这恰好是半监督学习关注的重点:即如何从标记数据以及未标记数据中学习出有用的知识从而来改善学习性能。 尽管以往许多半监督算法在很多实际应用中取得了成功,但也存在诸如邻域个数选择、对噪声、稀疏以及非平衡数据敏感等一系列问题。针对图的构建与优化等问题,本文对半监督降维算法进行了研究,并且在人脸识别、癌症分类等实际应用问题中验证了本文所提出算法的有效性。总的来说,本文主要贡献有: /(1/)本文提出一种基于局部估计误差的半监督维数约减算法/(LEESSDR/)。在半监督学习中,图的构建非常重要,然而以往很多半监督维数约减算法构造的邻域图是拓扑不稳定的,对邻域参数选择比较敏感以及对邻域图边权值设定不够准确。由于局部模型只是对特定数据的相邻点进行训练,因而局部学习算法常常超越全局学习算法。正是由于局部学习算法的良好表现,使得某个标签点可以很好地由它们的近邻来估计,因此LEESSDR通过使用局部学习投影/(LLP/)算法最小化局部估计误差来确定邻域图的边权值,最终有效地保持正负约束信息以及数据集所在低维流形的全局以及局部信息。由于LLP的优点在于该算法并没有要求输入空间局部线性,对于非线性的局部空间,LLP通过核函数将其映射到特征空间,然后在特征空间中求局部估计误差,因而提高了算法的参数鲁棒性。在Extended YaleB和CMU PIE标准人脸库上的实验结果表明LEESSDR算法的分类准确率以及鲁棒性都要优于其它半监督维数约减算法。 /(2/)本文提出了一种基于随机子空间的局部和全局保持的半监督维数约减算法/(RSLGSSDR/)。在半监督维数约减算法中,图的构建起着非常重要的作用,然而面临噪声的时候,当前的大部分算法所构造的邻域结构是拓扑不稳定的。RSLGSSDR主要是将随机子空间与半监督维数约减算法结合起来。在数据集的不同的随机子空间上,该算法首先设计多个不同的子图,然后将这些子图联合起来构建成一个混合图并且在其上进行维数约减,在保持数据集局部结构的同时能够保持其全局结构。在公共数据集上的实验结果表明RSLGSSDR算法具有较好的分类准确率和参数鲁棒性。 /(3/)本文提出了一种基于随机子空间的半监督维数约减算法/(RSSSDR/)。癌症分类对辅助临床决策很有作用,所以其精确的分类对于癌症的成功诊断和治疗是必不可少的。半监督维数约减算法在干净的数据集上表现地很好,然而当面临噪声的时候,当前的大部分算法所构造的邻域结构是拓扑不稳定的。RSSSDR主要是将随机子空间与半监督维数约减算法结合起来。在数据集的不同的随机子空间上,该算法首先设计多个不同的子图,然后将这些子图联合起来构建成一个混合图并且在其上进行维数约减。此外,该算法通过最小化局部重构误差来确定领域图的边权值,在保持癌症数据集局部结构的同时能够保持其全局结构。在公共癌症数据集上的实验结果表明RSSSDR算法具有较好的分类准确率和参数鲁棒性。 /(4/)首次将认知规律引入到半监督降维技术中来,设计了基于相对认知的半监督维数约减算法/(RSSDR/)算法。虽然半监督维数约减算法在很多实际应用中表现很好,然而当处理稀疏、噪声和非平衡数据时,它就难于确保构建一个良好的图进而影响了算法的表现。RSSDR根据认知的相对性规律提出了相对变换方法,通过相对变换将数据的原始空间变换到相对空间,在相对的空间中度量数据的相似性更符合人们的直觉,从而提高了数据之间的可区分性,同时在一定条件下相对变换还能抑制噪声的影响。然后,该算法通过最小化局部重构误差来确定邻域图的边权值,最终不仅能保持数据集所在低维流形的全局信息也能保持其局部信息。在人脸、基因表达谱、UCI以及噪声数据集上获得了较其它半监督维数约减算法更优的分类准确率以及鲁棒性。 With the rapid development of information technology, human society has entered the eraof big data and faced with the rapid growth of massive geometric data. How to obtain usefulknowledge from these massive data is one of the common challenges faced by the globalresearch scientists and technical experts in the present and the future. In addition, more andmore data, such as digital photographs, voice data, web text or gene expression microarrays,usually has the character of high dimensionality such that dimensionality reduction /(DR/) hasbecome an important tool to handle it and avoid the “curse of dimensionality”. The traditionalDR approaches can not reveal the low dimensional manifold structure of high dimensionaland nonlinear data for its linear nature although it can effectively learn the intrinsic structureof linear data. So we turn to help from manifold learning to solve the above problem.However, in many real world applications of pattern classification and data mining, it is easyto get a large number of unlabeled points and a small portion of labeled points, that’s justwhat semi-supervised learning cares about. Despite their success in many practical applications, those algorithms usually suffer fromsome limitations such as neighborhood parameter selection, sensitivity to noisy, sparse andimbalance data. This paper is a research on graph construction and optimation, especially fordimensionality reduction task. Finally I demonstrate the effectiveness of those methods bysome real world applications including face recognition, cancer classification and otherpractical applications. More concretely, the main contributions include: /(1/) Proposing a novel algorithm of Semi-supervised Dimensionality Reduction based onLocally Estimated Error /(LEESSDR/) and applying it to face recognition problem. It is wellknown that graph plays an important role in semi-supervised learning. However, thetopological structure of constructed neighborhood involved in these methods is unstable, byvirtue of sensitivity to the selection of neighborhood parameter and inaccurate in the setting ofthe edge weights of neighborhood graph. Since local models are trained only with the pointsthat are related to the particular data,local learning approaches often outperform global ones.The good performance of local learning methods indicates the label of a point can be wellestimated based on its neighbors. Under this motivation, we design LEESSDR algorithmbased on LLP. The algorithm can set the edge weights of neighborhood graph throughminimizing the local estimated error and can effectively preserve the global geometricstructure of the sampled data set as well as preserving its local one. Since LLP does notrequire local linear input space, for nonlinear local space, LLP maps it to the feature space byusing kernel functions, and then obtains its locally estimated error in the feature space. Theexperimental results on Extended YaleB and CMU PIE face databases demonstrate that LEESSDR is better than other semi-supervised dimensionality reduction algorithms in theperformance of classification and robustness. /(2/) Presenting a Local and Global Preserving Semi-supervised Dimensionality Reductionbased on Random Subspace /(RSLGSSDR/) method. Constructing a faithful graph ingraph-based semi-supervised classification is the first and most important step, however thetopology of the neighborhood constructed with most existing approaches is unstable in thepresence of noise. By combining the random subspace with the semi-superviseddimensionality reduction, RSLGSSDR first designs multiple diverse graphs in differentrandom subspace of data sets, then fuses these graphs into a mixture graph on whichdimensionality reduction is performed. It can effectively preserve the global geometricstructure of the sampled data set as well as preserving its local one. Experimental results onpublic data sets demonstrate that the proposed RSLGSSDR not only has superior recognitionperformance to competitive methods, but also is robust against a wide range of values of inputparameters. /(3/) Random Subspace-based Semi-supervised Dimensionality Reduction algorithmmarked as RSSSDR is proposed in this paper. Precise cancer classification is essential to thesuccessful diagnosis and treatment of cancers. Although semi-supervised dimensionalityreduction approaches perform very well on clean data sets, the topology of the neighborhoodconstructed with most existing approaches is unstable in the presence of noise. By combiningthe random subspace with the semi-supervised dimensionality reduction, RSSSDR first,designs multiple diverse graphs in different random subspaces of data sets and fuse them toform a mixture graph on which dimensionality reduction is performed. Subsequently, the edgeweights of neighborhood graph are determined through minimizing the local reconstructionerror, such that the global geometric structure of data can be preserved without changing thelocal geometric structure. Experimental results on public cancer data sets demonstrate that theproposed RSSSDR algorithm is of high classification accuracy and strong robustness. /(4/)Proposing Perceptual Relativity-based Semi-Supervised Dimensionality Reduction/(RSSDR/) Algorithm. Semi-supervised dimensionality reduction approaches perform verywell in many applications, however when dealing with the sparse, noisy and imbalance data,it cannot guarantee to construct a faithful graph which then influence the performance. Basedon the relative cognitive law, the relative transformation is presented in RSSDR, by which therelative space is constructed which may be more line with people’s intuition. It should beindicated that relative transformation can improve the distinguishing ability among datapoints and diminishes the impact of noise on semi-supervised dimensionality reduction. Subsequently the algorithm set the edge weights of neighborhood graph through minimizingthe local reconstruction error in the relative space and can preserve the global geometricstructure of the data as well as preserving its local geometric structure. The experimentalresults on face, gene expression, UCI and noisy data sets prove that our approach often givesthe better results in classification and robustness.
关 键 词: 机器学习 降维 半监督学习 生物信息学 人脸识别 癌症分类
分 类 号: [TP18]
领 域: [自动化与计算机技术] [自动化与计算机技术]