帮助 本站公告
您现在所在的位置:网站首页 > 知识中心 > 文献详情
文献详细Journal detailed

Web信息网络社区挖掘的关键技术研究
Research on Key Technologies of Mining Communities in Web Information Network

导  师: 肖南峰

学科专业: 081203

授予学位: 博士

作  者: ;

机构地区: 华南理工大学

摘  要: 自由开放共享的Web 2.0使得数以亿计的Web用户成为互联网的主角。以博客网络、邮件网络与在线聊天室网络等为代表的Web信息网络已经深入到人们的生产生活中,这些各式各样的Web信息网络承载着人们在生产生活中形成的复杂结构模式。如何从这些纷繁芜杂的关系结构中挖掘出隐藏的潜在有价值的社区结构模式是一个极具挑战性的研究问题,这对于提升信息服务质量,增强公共信息安全以及推进复杂网络分析等多个方面都具有广泛的理论和应用价值。 本文以Web信息网络社区挖掘为核心,遵循从内容特征驱动的社区挖掘开始,然后到结构特征驱动的社区挖掘,最后到内容与结构相融合特征驱动的社区挖掘这样一条研究路线,对Web信息网络社区挖掘的关键技术模型与算法进行了深入细致的探讨。本文的主要工作成果与创新如下: 1/)针对基于传统向量空间模型的Web文档聚类挖掘算法会产生假相似的低质量Web文档社区结构,提出一个基于粒度理论与文章结构理论的集文档表示与文档聚类于一体的多粒度层次模型MHRM,在文档表示过程中引入段落级粒度知识来缩小文档级知识粒度与特征词级知识粒度之间跨度,在段落级聚类时设计了基于容差粗集与基于Ontology的两种可供选择的零相似处理方法以降低聚类对象零相似发生的机率,在文档级聚类时提出了段落级粒度知识对文档级粒度知识的主题贡献度度量方法,MHRM模型能有效挖掘真实Web文档集隐含的社区结构。 2/)从种群多样性的角度分别设计了基于优生理论的遗传算法EBSGA与基于民主领导的粒子群算法MLCPSO,仿真实验表明此两种算法具有良好的寻优能力。在此基础上,提出了EBSGA与MLCPSO相混合的优化算法EBSGA//MLCPSO,把遗传算法所具有的优越局部搜索能力与粒子群算法所具有的超强全局搜索能力进行有机结合,结合新闻组社区挖掘的具体场景,引入SVD技术寻找新闻组的潜在语义子空间。提出的3种算法在真实新闻组数据的社区挖掘实验表明: EBSGA算法、MLCPSO算法与EBSGA//MLCPSO算法都能较好地发现新闻组内在的社区结构,但是EBSGA//MLCPSO算法挖掘的新闻组社区结构的质量最高。 3/)设计了一种基于离散粒子群算法的非重叠社区挖掘算法CDPSO,该算法给出了一种基于邻居节点有序表的粒子编码方案,将非重叠社区的模块度值引入作为粒子适应度,改进了传统离散粒子群算法的粒子位置更新策略,并从理论上分析了粒子位置更新策略的收敛性,算法CDPSO能够在无先验信息的条件下快速有效地揭示网络内在的社区结构。在CDPSO的基础上引入线图的概念,给出了线图节点集合的一个划分对应于原图节点集合的一个覆盖的线图性质并加以理论证明,提出了重叠社区挖掘算法LGPSO,该算法把原图的重叠社区挖掘转变成对应线图的非重叠社区挖掘,能够有效地挖掘网络的重叠社区结构。 4/)从理论上分析了典型谱聚类算法的基本思路,指出了每种算法的优势和不足,并在Web社区发现的实验场景中进行了各种典型谱聚类算法的性能比较,将谱图理论与粗糙集理论相结合,提出了一种基于谱映射与粗糙聚类的重叠社区发现方法RSC,该算法用上下近似来刻画网络节点的社区归属,边界表示社区之间共享的节点,通过优化重叠社区结构模块度来实现重叠社区的有效挖掘。 5/)分析了在线社交网络的异构性与海量性,给出了在线社交网络及其挖掘的形式定义,考虑到社区定义的多样性与不同定义的社区有着不同的应用背景,在分析现有的启发式挖掘算法的基础上提出了一个具有良好开放性的广义的启发式挖掘框架。 6/)结合聊天数据的特点,提出了一个内容特征与结构特征相结合的聊天室社区挖掘算法,该算法一方面借助WordNet等语义计算工具对聊天数据的内容相似性进行研究,另一方面借鉴语言学知识来分析聊天数据的对话线程结构关联性,该算法能有效地挖掘出聊天者之间隐含的关系。 Billions of web users in the world have been becoming the unchallengeable main players in the free, open and sharing Web 2.0 cyberspace. Web information networks represented by blog network, email network, wiki network and etc have spread to practically all corners of human lives. Such miscellaneous web information networks reflect the complicated production and life structure of human being. It is an enormously challenging undertaking to discover the potentially valuable and hidden communities from the sophisticated relational structures, which eventually would contribute to enhancement of information service, public security and the analysis of complex networks. The result of such undertaking would have general values both in theory and practice. This dissertation focuses on mining communities in web information network, and further studies key technique models and algorithms in mining web information network communities in detail. It proceeds along such research line starting from content characteristics of web information carrier, then covering structure characteristics, and ending with content combined with structure characteristics. The primary contributions of the dissertation are as follow: 1. Based on granular theory and article structure theory, a multi-granular hierarchical representation model /(MHRM/) integrating document representation and clustering is proposed in order to avoid inferior web document communities that may result from web documents clustering algorithms based on traditional vector space model. In MHRM, we have introduced paragraph level granular knowledge to bridge the gap between document level granular knowledge and term level granular knowledge in web document representation, and designed two optional strategies to deal with zero similarities among clustering objects in paragraph clustering, i.e., tolerance-rough-set based strategy and ontology based strategy, and presented a measure to evaluate the contribution of the paragraph level granular knowledge to topic of the document level granular knowledge in document clusteirng. It turned out that MHRM can effectively discover the true communities hidden in web documents. 2. A genetic algorithm, EBSGA, which is founded on eugenics and a particle swarm algorithm known as MLCPSO, which is based on democratic leadership, are designed to improve search ability of natural computing algorithms from the viewpoint of population diversity. Our simulation results have shown that both algorithms possess powerful optimizing ability. In order to discover communities hidden in newsgroup network, we have introduced SVD technique to construct latent semantic subspace of newsgroup, and then presented a hybrid optimization algorithm which consists of EBSGA and MLCPSO. This hybrid algorithm is proven to be able to leverage powerful global search ability of PSO and local search ability of GA, and can effectively detect the communities hidden in newsgroups. And a hybrid optimization algorithm, EBSGA//MLCPSO, which consists of EBSGA and MLCPSO is presented, which can leverage powerful global search ability of PSO and local search ability of GA, in order to discover communities hidden in newsgroup network, we have introduced SVD technique to construct latent semantic subspace of newsgroup. Experimental results from real datasets indicate that above three algorithms can effectively detect the communities hidden in newsgroups, but EBSGA//MLCPSO performs best. 3. A discrete PSO algorithm to mine non-overlapping communities, CDPSO, is devised. In this algorithm we have introduced community modularity as fitness of particle, devised a neighbor-list-based particle encoding scheme and an improved discrete particle position update strategy, and analyzed the convergence of the strategy. CDPSO can quickly and effectively discover the intrinsic community structure in networks without any domain information. Then we introduced the concept of line graph, and further proved the property that nodes partition of line graph is equivalent to nodes cover of the corresponding origin graph. Based on CDPSO and the property, an overlapping communities mining algorithm, LGPSO, is proposed which can effectively discover overlapping communities. 4. After the basic idea of typical spectral clustering algorithms is analyzed, advantages and disadvantages of each algorithm are pointed out, and performance of the typical algorithms is compared in discovering web communities. And more, we have integrated spectral graph theory and rough set theory, and proposed the overlapping communities mining algorithm RSC, which can describe communities membership of network nodes with lower and upper approximation, describe the network nodes shared by different communities with boundary, and so achieve effective discovery of overlapping communities by optimizing modularity of overlapping communities. 5. Heterogeneity and massiveness of online social networks/(OLNs/)are analyzed, and OLNs and its mining task is formally defined. In consideration of variety of community definitions, and the fact that a community structure conforming to some given definition is applicable to different places and, further, based on the analysis of current heuristic mining approaches, we have presented a general heuristic framework for communities discovery, which has good openness and feasibility. 6. An effective algorithm to discover communities hidden in online chatting rooms based on content and thread structure of chatting text stream is proposed. The algorithm can effectively discover hidden relations of the involved chatters with the following two techniques: one is to semantically compute chatting text content similarity, and another is to analyze thread structure association of dialogs in text stream.

关 键 词: 信息网络 社区挖掘 聚类 自然计算 谱图理论

分 类 号: [TP393.09]

领  域: [自动化与计算机技术] [自动化与计算机技术]

相关作者

作者 汤俊
作者 洪明
作者 孙宗锋
作者 谷斌
作者 钟美华

相关机构对象

机构 中山大学
机构 华南农业大学
机构 华南理工大学
机构 中山大学资讯管理学院信息管理系
机构 华南师范大学

相关领域作者

作者 李文姬
作者 邵慧君
作者 杜松华
作者 周国林
作者 邢弘昊