机构地区: 湖南大学电气与信息工程学院
出 处: 《微计算机信息》 2006年第07X期203-205,共3页
摘 要: 现有搜索引擎技术返回给用户的信息太多太杂,为此提出一种针对小文本的基于近似网页聚类算法的Web文本数据挖掘技术,该技术根据用户的兴趣程度形成词汇库,利用模糊聚类方法获得分词词典组,采用MD5算法去除重复页面,采用近似网页聚类算法对剩余页面聚类,并用马尔可夫Web序列挖掘算法对聚类结果排序,从而提供用户感兴趣的网页簇序列,使用户可以迅速找到感兴趣的页面。实验证明该算法在保证查全率和查准率的基础上大大提高了搜索效率。由于是针对小文本的数据挖掘,所研究的算法时间和空间复杂度都不高,因此有望成为一种实用、有效的信息检索技术。 As the usual search engines often return too massive and disorder information, an algorithm on clustering Web pages in view of small texts is proposed.This algorithm expresses the text characteristic by using the vector space model and clusters the vocabulary interested (users can initialize it according needs) by the users with fuzzy clustering analysis method to obtain knowledge pattern ,removes the repeated pages by using MDS. The rest pages are clustered by using the approximate pages clusters algorithm and ordered by using a data mining algorithm of Web accessing sequence based on Markov' s chain to make users obtain the cared approximate pages clusters. The experiment indicates that this algorithm greatly enhance the searching efficiency. Because the data mining points to small texts, the complexity of time and space axe not high. So it is hopeful to become a practicable and information searching technology.
领 域: [自动化与计算机技术] [自动化与计算机技术]