机构地区: 华南理工大学软件学院
出 处: 《计算机工程与科学》 2012年第11期1-6,共6页
摘 要: 传统网络爬虫为基于关键字检索的通用搜索引擎服务,无法抓取网页类别信息,给文本聚类和话题检测带来计算效率和准确度问题。本文提出基于站点分层结构的网页分类与抽取,通过构建虚拟站点层次分类树并抽取真实站点分层结构,设计并实现了面向分层结构的网页抓取;对于无分类信息的站点,给出了基于标题的网页分类技术,包括领域知识库构建和基于《知网》的词语语义相似度计算。实验结果表明,该方法具有良好的分类效果。 Traditional web crawler provides services based on searching keywords. It cannot extract the categorization information of web pages, thus resulting in efficiency and accuracy problems on text clustering and topic detection. To solve this problem, a method of categorization and extraction of web pages based on hierarchy is proposed in this paper. By building a virtual hierarchy categorization tree and extracting the hierarchies of real web sites, a web page is categorized when it is crawled. For sites which have no categorization information, a page title based categorization algorithm is presented, including building up the domain knowledge base and calculating the semantic similarity based on Hownet. The experimental results demonstrate that this method achieves preferable effects.
领 域: [自动化与计算机技术] [自动化与计算机技术]