机构地区: 中国科学技术大学计算机科学与技术学院
出 处: 《中国科学技术大学学报》 2007年第9期1080-1087,共8页
摘 要: 基因表达数据集与传统事务数据集相比呈现出新的特征,由于其项目数远远大于事务数,使得大量现有的基于项目枚举的频繁闭合模式挖掘算法不再适用.为此提出一种频繁闭合模式挖掘新算法TPclose,使用TP-树(tidset-prefix tree)保存项目的事务集信息.该算法将频繁闭合模式挖掘问题转换成频繁闭合事务集挖掘问题,采取自顶向下分而治之的事务搜索策略,并组合了高效的修剪技术和有效的优化技术.实验表明,TPclose算法普遍快于自底向上事务搜索算法RERⅡ,最高达2个数量级以上. Unlike the traditional datasets, gene expression datasets typically contain a huge number of items and a few transactions. While there are large numbers of algorithms developed for frequent closed patterns mining, their running time increased exponentially with increasing average length of the transactions, thus such gene expression datasets render most current algorithms impractical. TPclose, a new efficient algorithm for mining frequent closed patterns from gene expression datasets was proposed. It stored the tidset of each item using a TP tree (tidset-prefix tree). TPclose converted the problem of mining frequent closed patterns into one of mining frequent closed tidsets, adopting the top-down and divide-and-conquer search strategy to explore transaction enumeration search space and combining efficient pruning and effective optimizing. Several experiments on real-life gene expression datasets show that TPclose outperforms RER Ⅱ , an existing algorithm based on bottom-up search strategy, by up to two orders of magnitude.
关 键 词: 数据挖掘 关联规则 频繁闭合模式 基因表达数据 自顶向下
领 域: [自动化与计算机技术] [自动化与计算机技术]