帮助 本站公告
您现在所在的位置:网站首页 > 知识中心 > 文献详情
文献详细Journal detailed

多示例学习算法及其应用研究
Research on Algorithms and Application of Multi-instance Learning

导  师: 邓辉舫

学科专业: 081203

授予学位: 博士

作  者: ;

机构地区: 华南理工大学

摘  要: 随着人类收集和存储数据能力的不断增长以及计算机运算能力的飞速发展,利用计算机来分析数据的要求越来越广泛和迫切,使得机器学习的重要性越来越显著。多示例学习是一种新的机器学习方法,近年来逐渐成为机器学习领域关注的一个研究热点。它有别于传统的有监督学习、无监督学习和近年来提出的半监督学习方法,被认为是一种新的学习框架。在多示例学习中,训练集由若干个具有标签的包组成,每个包含有若干个没有标签的示例。若一个包中至少有一个示例为正类,则该包被标记为正包;若一个包中所有示例都为负类,则该包被标记为负包。希望学习系统通过对训练包的学习,能够正确预测出新包的标签。由于它的训练样本的层次性表示结构,相较于平板式的单样本属性对表示,更能反映一些现实问题的逻辑结构,使得它在区分“粗标签”对象上具有独特的优势,已经得到广泛的应用,例如:药物活性分子预测,图像检索、分类和标注,文本分类,蛋白质家族预测,目录网页和链接推荐、计算机安全、计算机辅助医学诊断等领域。 本文在分析了多示例学习算法国内外研究现状和目前仍存在问题的基础上,对多示例学习算法中依赖于单个示例、包特征的构造、包特征的降维、并行算法进行研究,提出了几个多示例学习算法,并把它们应用于图像检索和分类。主要研究成果如下: 1、针对已有的多示例学习算法应用于图像检索时存在依赖于单个示例和耗时较长的缺陷,本章提出一个基于多示例学习和贝叶斯分类的图像检索方法(MIL-Bayesian)。首先,将每幅图像分割成多个区域,把图像看作多示例学习中的包,区域看作为包中的示例;其次,计算所有图像中每个区域的多样性密度(DD)函数值,提取出可能的正区域组成一个集合,使用高斯混合函数逼近估计正区域的类条件概率密度;接着,使用贝叶斯分类器为每幅图像计算一个相对于正类图像的后验概率,并根据后验概率值大小排序返回给用户;最后,经过几轮的用户相关反馈后,用户得到一组满意的图像。在Corel图像集上的实验表明,提出的方法具有好的检索精度和高的检索效率。 2、针对多示例学习中构造包特征依赖于少数示例特征的缺陷和缩窄图像的低级特征表示与高级概念之间的语义鸿沟,本章利用密度聚类获得的簇分布信息和多示例学习框架在区分歧义性对象上的特点,提出一个基于区域特征密度聚类和多示例学习的图像分类方法/(DCRF-MIL/)。该方法首先将每个图像分割为多个区域,将所有区域组成一个集合,在这个区域集合上,使用密度聚类算法学习到区域特征的簇分布信息;其次,将图像看作包,区域看作包中的示例,基于区域特征的簇分布信息,将包映射为簇分布空间上的一个向量作为包的特征,使得包特征带有图像区域的语义信息;最后,使用支持向量机算法,在带有包特征的训练集上训练分类器,对测试图像进行分类。在Corel图像集和MUSK分子活性预测数据集上的实验表明,DCRF-MIL算法具有分类精度高和参数易于选择等特点。 3、针对多示例学习中变换示例空间后获得的包特征的高维问题,本章提出一个基于多个子空间集成的多示例学习算法/(MSEMIL/),和它的并行实现算法/(P/_MSEMIL/)。该方法首先将多示例学习中的包,向所有示例组成的示例空间映射得到一个包特征;其次,通过融合bagging法选取训练样本子集和随机选取特征子集的方法,将训练集和测试集划分成多个子空间,在每个子空间上训练一个半监督子分类器;随后,通过集成策略合并多个子分类器的分类结果,得到一个多示例学习集成分类器。最后,在机群计算系统上,应用基于Java的分布式并行计算中间件ProActive,实现这个集成分类器的并行算法。在MUSK和Corel数据集上的实验表明,与其它同类算法相比,MSEMIL具有分类精度高、对标签噪声健壮的特点。实验还表明,P/_MSEMIL具有小的计算耗时和较高的加速比等特点。 With both the continuous increase in human being’s ability in collecting and storing dataand the rapid development of computer’s computing ability, the requirement of analyzing databy computers is more popular and more urgent than before, which makes the machinelearning techniques be more and more significant. In recent years multi-instance learning/(MIL/), a kind of new machine learning method, has become one of research hotspots inmachine learning field. MIL is different from the traditional supervised learning, unsupervisedlearning and the recent proposed semi-supervised learning method, so it is considered to be anew learning framework. In the framework of MIL, the training set consists of a number ofbags with labels, and each bag contains a number of unlabeled instances. A bag is labeledpositive if at least one instance in it is positive, or labeled negative if and only if all of itsinstances are negative. The objective of MIL is to construct a learning system which is learnedbased on the training bags can correctly predict the label of new bags. Because of the hierarchical representation structure of training samples, the MIL canrepresent the logical structure of some real problems more accurately than the traditionalsingle-instance-label representation can, which makes it has the unique superiority indistinguishing the so-called 'ambiguity objects'. Consequently, it has been widely applied tovarious applications, such as: drug activity prediction, image retrieval, image categorization,image annotation,text categorization, protein function prediction, web index page and link ofrecommendation, computer security, computer-aided diagnostics, and so on. Based on the analysis of research status and disadvantages of these existing MILalgorithms, this paper focuses on the research of the following issues of MIL, i.e., the relyingon a single instance, construction of bag features, reduction of bag features, parallelization ofMIL algorithms and their applications to image retrieval, image classification. Some goodexperimental results have been obtained and they can be summarized as follows: 1. Aiming at the two disadvantages of existing MIL algorithms for image retrieval, i.e.the dependence on the presence of a single instance and high time-consuming, this chapterpresents an MIL and Bayesian classification based image retrieval method /(MIL-Bayesian/). In MIL-Bayesian, firstly, each image is divided into several regions, and the image is viewed asone bag in MIL, each region is regarded as an instance in corresponding bag. Secondly,calculate the diversity density /(DD/) function values of each region, and extract the possiblepositive region to compose a set, then estimate the class conditional probability densityfunction of positive regions using Gaussian mixture model. Thirdly, a Bayesian classifier isused to calculate the posterior probability of images with positive class label, and then theretrieved results are returned to user according to the posterior probability values indescending order. Finally, after several rounds of user relevance feedback, the user gets asatisfactory image. The experimental results on the Corel image set show that the proposedmethod has good retrieval precision and high retrieval efficiency. 2. In order to narrow the semantic gap between low-level visual features and high-levelsemantic concepts in image categorization, this chapter exploits the clustering informationfrom a density clustering algorithm and the characteristics of multi-instance learningframework in distinguishing ambiguous object, proposes an image categorization methodusing density clustering on region feature and multi-instance learning, termed as DCRF-MILwhich treats image classification as a multi-instance learning problem. Firstly, it divides eachimage into a number of regions, re-lines up all regions into a collection, and then uses adensity clustering algorithm to learn the potential distribution information of region featuresin the collection. Secondly, it treats image as bag and regions as instances. Based on thecluster distribution information of region features, the bag is mapped into a vector in thecluster distribution space. Finally, a support vector machine classifier is constructed to predictthe class label of the unlabeled image. The experiments on the Corel image data set andMUSK molecular activity prediction data set show DCRF-MIL algorithm has highclassification accuracy and its parameters are easy to select. 3. Aiming at the high-dimentional problem of bag features derived from transformationof instance space in MIL, this chapter proposes one multi-sub-space based ensemble MILalgorithms /(MSEMIL/) and its parallel version /(P/_MSEMIL/). Firstly, this method determinedthe bag feature by mapping the bag in MIL into the instance space which is consist of allinstances; Secondly, by incorporating the bagging-based training samples selection methodand random feature subset selection method, the training set and test set is divided into multiple sub-spaces, and the semi-supervised learning is conducted in each sub-space toobtain one corresponding classifier. Consequently one ensemble classification system can beachieved by integrating the classification results of these multiple base classifiers. Finally, incluster computing systems, the P/_MSEMIL is realized by using ProActive, which isJava-based distributed parallel computing middleware. The experimental results on MUSKand Corel data sets show that, compared with other similar algorithms, MSEMIL has higherclassification accuracy and better robustness to label noise. The experimental results alsoshow that the P/_MSEMIL has a lower computation time, higher speed up ratio and othercharacteristics.

关 键 词: 模式分类 多示例学习 机器学习 图像检索 图像分类 并行算法

分 类 号: [TP181 TP391.41]

领  域: [自动化与计算机技术] [自动化与计算机技术] [自动化与计算机技术] [自动化与计算机技术]

相关作者

作者 陈炬桦
作者 项益民

相关机构对象

机构 中山大学信息科学与技术学院软件研究所
机构 中山大学资讯管理学院信息管理系
机构 华南师范大学经济与管理学院
机构 广东省立中山图书馆

相关领域作者

作者 李文姬
作者 邵慧君
作者 杜松华
作者 周国林
作者 邢弘昊