机构地区: 南京大学计算机科学与技术系计算机软件新技术国家重点实验室
出 处: 《计算机科学》 2002年第6期52-54,共3页
摘 要: 1引言 随着Internet在全球的迅速发展,WWW(World WideWeb)已经发展成为一个包含多种信息资源、站点遍布全球的巨大信息服务网络,为用户提供了一个极具价值的信息源,并已成为世界范围内信息共享和信息传播的最主要渠道之一.WWW系统一经出现,就得到了迅猛的发展,无论是WWW站点数还是WWW用户数,都是以每年5~10倍的速度呈指数形式增长.目前仅中国的Internet用户就已经达到了2500万. Information on Web is expanding rapidly, but the quality differs greatly, which makes Web information retrieval and mining more difficulty. Not only research on the technology of information retrieval and Web mining itself needs to be made, but also cleaning Web documents must be done before Web information retrieval and Web mining. However, the latter is often delegted in most current reseach work. This paper puts forward the concept of Web document cleaning. introduces the role that Web document cleaning plays in Web information processing and the process of Web document cleaning. A rule-based system of Web document cleaning is implemented.
关 键 词: 信息资源 信息挖掘 文档清洗 计算机网络 信息检索
领 域: [自动化与计算机技术] [自动化与计算机技术] [文化科学]