机构地区: 中国科学院计算技术研究所
出 处: 《计算机研究与发展》 2012年第11期2344-2351,共8页
摘 要: 关键词抽取是从文本中抽取代表性关键词的过程,在文本处理领域中具有重要的应用价值.利用一种近年来受到广泛关注的新的信息源——社会化标签(tag)——来提高网页关键词抽取的质量.通过对Tag数据进行统计分析,发现用户往往对多个在话题上相关的网页使用同样的标签词,一个特定的文档可以通过其标注信息找到相关文档.在此基础上,提出了利用Tag进行关键词抽取的框架,并给出了一种具体的实现方法Tag-TextRank.该方法在TextRank基础上,通过目标文档中的每个Tag引入相关文档来估计词项图的边权重并计算得到词项的重要度,最后将不同Tag下的词项权重计算结果进行融合.在公开语料上的实验表明,Tag-TextRank在各项评价指标上均优于经典的关键词抽取方法TextRank,并具有很好的推广性. Keyword extraction is to extract representative keywords from texts and has been widely used in most text processing applications. In this paper, we explore the use of tags for improving the performance of webpage keyword extraction task. Specifically, we first analyze the characteristics of bookmarking behavior and find that people usually use the same tags to label multiple topic-related webpages, which is shown by the fact that over 90~ of labeled webpages can find relevant webpages through their tag information. Based on the discovery, we propose a method called Tag-TextRank. As an extension of the classic keyword extraction method TextRank, Tag-TextRank calculates the term importance based on a weighted term graph and the edge weight for a term pair is estimated by the statistics of the relevant documents which are introduced by a certain tag of the target webpage. The final importance score for a term is the combination of the above tag dependent importance scores. Tag-TextRank can measure the term relations by utilizing more documents so as to better estimate the term importance. Experimental results on a publicly available corpus show that Tag- TextRank outperforms TextRank on various metrics.
领 域: [自动化与计算机技术] [自动化与计算机技术]