文献详情 - Gdtheory理论粤军网|广东智库信息化平台

文献详细_{Journal detailed}

InfoSigs:一种面向Web对象的细粒度聚类算法
InfoSigs:A Fine-Grained Clustering Algorithm for Web Objects

下载全文在线阅读

收藏

作　　者： ; ; ; ; ;

出　　处： 《计算机研究与发展》 2010年第5期796-803,共8页

摘　　要： 面向Web对象的细粒度聚类已经成为学术界研究的热点.然而现有大多数聚类模型只关注如何对文本内容或文章主题进行聚类,聚类结果粒度较粗,无法满足大规模网络信息检索的质量要求.针对上述挑战,充分挖掘Web文档中词汇间的树状概率层次关系,提出一种以词汇信息分布作为特征标志的聚类算法InfoSigs,实现对Web对象的细粒度聚类.算法构建一个信息传递有向无环图,根据词汇在图中信息分布的集中度赋予其合理的权重,产生更具代表性的特征向量;同时算法提出了一个自适应的记录合并模型,有效提高记录簇中记录间的相似度,减少噪音对合并过程的影响.实验结果表明,InfoSigs算法比传统聚类算法—I-Match和Shingling—在F-Measure值上平均约有21.3%的提高,可以有效地运用到多领域Web对象的聚类问题. Clustering of objects in Web（IR） documents has recently become a hot topic in the research community of Web information retrieval（IR） Generally,quality Web IR requires fine-grained clustering of objects in documents However,the present clustering algorithms are mostly confined to the level of sentence structure or textual topic The lack of consideration of token information for identifying more detailed-level objects often leads to coarse-grained clustering results To address this problem,the authors propose a novel fine-grained clustering algorithm named InfoSigs,which captures the token information signatures inside Web documents The work contains two contributions：Firstly,techniques are presented to construct a directed acyclic graph of information-transmission from token frequency sequences implying probabilistic hierarchy property between tokens Each token feature is given a weight value based on the aggregated information distribution obtained from the signatures in the graph Secondly,a self-tuning method is proposed for merging records that are of high similarity This can effectively reduce the impact from noises The experiments on real datasets show that the proposed InfoSigs algorithm outperforms the conventional algorithms,such as I-Match and Shingling,with average improvements of 213% in terms of the F-Measure The results indicate that InfoSigs is able to effectively generate more fine-grained clustering results compared with the conventional methods

关键词： 对象词频序信息分布集中度相似度直方图记录簇

领　　域： [自动化与计算机技术] [自动化与计算机技术]

InfoSigs:一种面向Web对象的细粒度聚类算法
InfoSigs:A Fine-Grained Clustering Algorithm for Web Objects

参考文献更多+

二级参考文献更多+

引证文献更多+

二级引证文献更多+

同被引文献更多+

耦合作品文献更多+

相关文献更多+

相关作者

相关机构对象

相关领域作者

InfoSigs:一种面向Web对象的细粒度聚类算法 InfoSigs:A Fine-Grained Clustering Algorithm for Web Objects

参考文献 更多+

二级参考文献 更多+

引证文献 更多+

二级引证文献 更多+

同被引文献 更多+

耦合作品文献 更多+

相关文献 更多+

相关作者

相关机构对象

相关领域作者

InfoSigs:一种面向Web对象的细粒度聚类算法
InfoSigs:A Fine-Grained Clustering Algorithm for Web Objects

参考文献更多+

二级参考文献更多+

引证文献更多+

二级引证文献更多+

同被引文献更多+

耦合作品文献更多+

相关文献更多+