导 师: 胡明涵
学科专业: 081203
授予学位: 硕士
作 者: ;
机构地区: 东北大学
摘 要: 企业实体间关系的抽取是实体关系抽取的一种,是一种典型的信息抽取问题。在MUC和ACE评测的推动下,近年来国内外的实体关系抽取的研究工作取得了巨大进步,研究者提出了众多有效的解决方法。其中基于机器学习的方法,在定义好关系类型的前提下,将关系抽取问题转换为分类问题,显示了非常好的性能。基于平面特征向量的方法就是全监督学习的一种,它针对实体对在句子中的上下文中的词语、词性、实体类型等构建特征向量,建立向量空间模型,进而使用分类器进行关系类型识别。本文使用了这一种方法。另一种全监督学习的方法使用了核特征,它是在实体对出现的上下文片段中进行浅层句法分析,通过构造核函数计算两个结构对象/(如语法树结构/)的的相似性,也取得非常不错的效果。 我们首先针对企业关系的特点,定义出了六种典型的关系类型,并为每一种关系定义相应的关键词列表,然后从web上爬取得到一个较大规模的数据集。数据集经过预处理之后,人工标注出一个小规模的实例集合,并随机生成一个测试集。本文首先将我们的标注集合作为训练数据,使用了平面特征向量的全监督学习的方法构建了一个企业实体关系抽取系统。系统使用的平面特征包括实体前后的一定窗口大小内的四种词,使用了SVM和kNN两种分类器。 现存的多数方法是基于大规模标注语料,进行全监督学习从而获得抽取结果。然而在现实中大多数情况下,我们缺乏标注语料,同时又易获得大量的未标注语料。为此本文构建了一种基于模板的半监督学习的企业实体关系抽取系统。这个系统将标注数据作为种子,运用了一套有效的模板学习和评价机制,以及实例匹配和评价机制,进而扩大可信实例集合。经过多次bootstrapping迭代,得到质量较高的模板集,进而对测试实例集进行� Relation Extraction between Enterprise Entities is one of entities extraction, and it's a typical Information Extraction problem. Fueled by MUC and ACE Evaluations, the research on this subject has made great progress, and the researchers provided many effective methods to solve the problem. Among these methods, the solution based on Machine Learning comes into outstanding performance, which turns the Relation Extraction into Classification on the premise of definition of relation types. The method based on Feature Vector is one of these methods, which builds Feature Vector by words, part of speech, and type of entities etc, where exist in the context of entities pair in the sentence, build Vector Space Model and then use the classifier to recognize the relation type. In this paper, we use this method as our first solution. Another method based on supervised learning is to use Kernel Feature, which is to shallow syntactic parse the context in which entities appear, and construct a kernel fuction to calculate the similarity between the structured object, such as a syntax tree, and this method also shows a good performance. In this passage, first we define six typical types of relation between enterprises according to the charactistic of this kind of relation, construct lists of keywords for each relation, and crawl the web to get a large scale of data set. Through the pre-proceeding, we mark a small scale of instance set, and get a testing data set generating randomly. The first method in this paper is to use the surface feature build a Enterprise Entities Relation Extraction System, with the marked dataset as training corpus. The features which we use include the words of four kind of parts of word in the window in front of and behind the entities, and we choose the SVM and kNN as the classifier. Most of existed methods need a large scale of marked corpus, and get the extraction results through supervised learning. However, in most of cases in reality, we are in lack of marked corpus. For this reason, the
关 键 词: 实体关系抽取 机器学习 特征向量 半监督学习 框架
领 域: [自动化与计算机技术] [自动化与计算机技术]