中国科学院软件研究所机构知识库
Advanced  
ISCAS OpenIR  > 人机交互技术与智能信息处理实验室  > 学位论文
题名:
数据流上的频繁项集挖掘技术研究
作者: 李坤
答辩日期: 2009-01-18
授予单位: 中国科学院软件研究所
授予地点: 软件研究所
学位: 博士
关键词: 数据流 ; 滑动窗口 ; ε-近似 ; 频繁元素 ; 频繁项集 ; 前缀树 ; 乐观裁剪方法
其他题名: A Study on Frequent Itemsets Mining in Data Streams
摘要: 数据流是近年出现的一个新的应用类型,具有连续、无限、高速等特点。典型的数据流包括:无线传感器网络应用环境中由传感器传回的各种监测数据、股票交易所的股票价格信息、网络监测系统与道路交通监测系统的监测数据、电信部门的通话记录数据,以及网站的日志信息等。数据流的出现对传统的数据管理和挖掘技术提出了巨大的挑战。传统的数据挖掘技术往往对静态数据集合做多遍扫描,其时间和空间复杂度均较高,难以直接应用到数据流环境中。本文对数据流上的频繁项集挖掘问题做了深入研究,主要研究内容和创新性成果概述如下: 本文首先对频繁项集挖掘问题做了一个全面的综述。综述部分先对静态数据集上的频繁项集挖掘的概念、分类、经典算法等相关研究做全面的介绍,然后分析了在数据流上进行频繁项集挖掘面临的问题和挑战、以及研究现状。 针对数据流上的频繁元素挖掘问题,本文提出了一个简单而高效的算法,挖掘数据流滑动窗口上的频繁元素。算法既可以定期返回满足ε-近似要求的频繁元素,也可以响应用户在任意时间提交的请求,返回满足误差要求的结果。 针对数据流上的频繁项集挖掘问题,本文提出了BFI-Stream算法,挖掘数据流滑动窗口上的所有频繁项集,实时返回精确结果。该算法使用前缀树数据结构,并且在创建和更新过程中裁剪了一部分非频繁节点,因此算法的空间和时间复杂度都较低。 接着,本文针对现有的在数据流上挖掘频繁项集的算法存在维护过多非频繁项集而导致使用空间过大的问题,提出了一种乐观裁剪方法,大大降低了算法的空间复杂度。文中先对实际数据集分析了项集的频率分布情况,提出了乐观裁剪方法,裁剪大部分非频繁项集;实验结果表明乐观裁剪方法不仅大大降低了内存使用量,还提高了算法的更新效率。 再次,本文针对用户指定最小支持度和允许误差的近似查询,提出了在数据流滑动窗口上挖掘频繁项集的近似算法AFI-Stream,返回满足误差要求的结果。AFI-Stream仅仅维护频繁项集,不维护非频繁项集,因此能大大降低算法使用的内存。 为了满足在数据流上挖掘频繁项集研究的需要,本文设计并开发了一个数据流频繁项集挖掘原型系统StreamMiner,进行相关算法的评测和研究。
英文摘要: Data stream is a new emerging class of applications in recent years, which is often continuous, unbounded, high-speed and has a data distribution changing with time. Examples of such applications include financial analysis, network monitoring, sensor networks, telecommunication data management, and others. Mining in data streams incurs extra challenges. The traditional mining techniques usually require multiple scans on the static dataset, which result in high complexity of both time and space, so it is hard to use these traditional algorithms directly on data streams. Mining data streams requires fast, real-time processing in order to keep up with the high-speed data arrival and mining results must be returned within short response time. In this thesis, the design of several core technologies for mining frequent itemsets in data streams is investigated. First of all, the thesis gives a comprehensive overview of the frequent itemset mining algorithms. This overview first describes the concepts, classification and classical algorithms of mining frequent itemsets on static datasets. It also discusses the problems, challenges and current status of the frequent itemset mining in data streams. Second, the problem of mining frequent items in data streams is studied. A simple and efficient algorithm is proposed to find frequent items in sliding window data streams. It can return not only the deterministic ε-approximate results periodically, but also the approximate results for the random queries at any time. It is guaranteed that the errors of all the results are smaller than the user specified error bound. Third, the problem of mining frequent itemsets in data streams is studied. The thesis proposes the Bounded Frequent Itemsets stream (abbreviated as BFI-stream) algorithm, which uses a prefix-tree based structure, called BFI-tree, to maintain all accurate frequent itemsets from sliding windows over data streams. It can process the mining requests at any time and mining all frequent itemsets with accurate frequencies is just a traversal on the tree, so it has a real-time response to the requests. Next, the thesis considers the problem of huge space usage of the current algorithms. It first analyzes the distribution of itemsets' frequencies by real data sets and then proposes an optimistic pruning method to reduce the space complexity. The experimental results show that the space performance is improved greatly even when the user specified minimum support threshold is small. Furthermore, to answer queries with minimum support and error bound, the thesis proposes an approximate algorithm, called AFI-Stream, to mine frequent itemsets in sliding window streams. The AFI-Stream algorithm only maintains frequent itemsets, so the memory usage is reduced a lot. At last, a stream mining prototype, named StreamMiner, is developed to evaluate frequent itemset mining algorithms in data streams.
语种: 中文
内容类型: 学位论文
URI标识: http://ir.iscas.ac.cn/handle/311060/6402
Appears in Collections:人机交互技术与智能信息处理实验室_学位论文

Files in This Item:
File Name/ File Size Content Type Version Access License
10001_200518015029032李坤_paper.pdf(4513KB)----限制开放-- 联系获取全文

Recommended Citation:
李坤. 数据流上的频繁项集挖掘技术研究[D]. 软件研究所. 中国科学院软件研究所. 2009-01-18.
Service
Recommend this item
Sava as my favorate item
Show this item's statistics
Export Endnote File
Google Scholar
Similar articles in Google Scholar
[李坤]'s Articles
CSDL cross search
Similar articles in CSDL Cross Search
[李坤]‘s Articles
Related Copyright Policies
Null
Social Bookmarking
Add to CiteULike Add to Connotea Add to Del.icio.us Add to Digg Add to Reddit
所有评论 (0)
暂无评论
 
评注功能仅针对注册用户开放,请您登录
您对该条目有什么异议,请填写以下表单,管理员会尽快联系您。
内 容:
Email:  *
单位:
验证码:   刷新
您在IR的使用过程中有什么好的想法或者建议可以反馈给我们。
标 题:
 *
内 容:
Email:  *
验证码:   刷新

Items in IR are protected by copyright, with all rights reserved, unless otherwise indicated.

 

 

Valid XHTML 1.0!
Copyright © 2007-2017  中国科学院软件研究所 - Feedback
Powered by CSpace