中国科学院软件研究所机构知识库
Advanced  
ISCAS OpenIR  > 中科院软件所  > 中科院软件所
题名:
基于索引的压缩文本查找算法研究及其并行化实现
作者: 李开士
答辩日期: 2007-06-11
授予单位: 中国科学院软件研究所
授予地点: 软件研究所
学位: 博士
关键词: 压缩 ; 自索引 ; FM-index ; 分块 ; 并行
摘要: 摘要 压缩查询是近几年兴起的一种文本模式查找技术,它是通过查找压缩文本实现初始文本的查找。在最初的时候,压缩查询是在线的,也就是在压缩文本上直接执行模式子串的匹配操作。对于海量的文本,在线的压缩查询不能满足性能要求,于是将压缩和索引结合起来形成了压缩索引的文本查询。文本索引一般包含两种类型:单词索引和全文索引。单词索引只能检索文本中的单词,而全文索引可以检索任意字符串。自索引是一种完全的索引,它完全包含索引的文本数据。 FM-index是一种压缩的自索引结构,可以实现全文检索。FM-index在简单查询和索引空间占用方面达到了很好的性能,但是在建立索引时却要消耗很大的内存,并且很难处理复杂查询,于是我们提出了一种重叠分块的FM-index索引结构,并且使用MPI对重叠分块FM-index的建立和查询实现了并行化。我们通过实验发现重叠分块的FM-index很好的解决了建立时内存占用过多的问题,它占用的内存跟文本数据量大小没有关系,只跟分块大小有关;它相对于原来的FM-index在复杂查询方面性能也有所改进;同时它占用的存储空间跟原来FM-index的大致相等。重叠分块FM-index的并行化也取得了较好的加速比效率,具有较好的可扩展性。
英文摘要: Abstract Compressed query is a new approach to text pattern matching problems, which achieves its goal through querying on compressed text. At its early stage, compressed query is on-line, i.e., to directly scan on compressed text from the beginning to the end. However, for massive volume data, this is not enough. Thus index is introduced into compressed query. There’re two kinds of text index-one is word-based index and the other one is full-text index. Word-based index can be used to querying words in text, while full-text index can support any string matching in text. Self-index is an index that contains the full text information in the index. FM-index is a compressed self-index and also is a full-text index. It occupies small storage space and is very quick when dealing with simple query. But it needs a lot of memory to build the index and is difficult to support complex query. To solve those problems, we designed an overlapping blocked FM-index algorithm and parallelized it with MPI. Through experiments, we found that overlapping blocked FM-index algorithm realized our goals. During building period, its memory occupation doesn’t depend on the volume size of the text, but depends on the block size. Compared to the original FM-index algorithm, overlapping blocked FM-index algorithm consumes less time to handle complex query. The storage space of overlapping blocked FM-index algorithm is almost equal to its original version. The parallel version of overlapping blocked FM-index algorithm achieved acceptable speedup and well scalability.
语种: 中文
内容类型: 学位论文
URI标识: http://ir.iscas.ac.cn/handle/311060/6534
Appears in Collections:中科院软件所

Files in This Item:
File Name/ File Size Content Type Version Access License
10001_200328015004320李开士_paper.doc(559KB)----限制开放-- 联系获取全文

Recommended Citation:
李开士. 基于索引的压缩文本查找算法研究及其并行化实现[D]. 软件研究所. 中国科学院软件研究所. 2007-06-11.
Service
Recommend this item
Sava as my favorate item
Show this item's statistics
Export Endnote File
Google Scholar
Similar articles in Google Scholar
[李开士]'s Articles
CSDL cross search
Similar articles in CSDL Cross Search
[李开士]‘s Articles
Related Copyright Policies
Null
Social Bookmarking
Add to CiteULike Add to Connotea Add to Del.icio.us Add to Digg Add to Reddit
所有评论 (0)
暂无评论
 
评注功能仅针对注册用户开放,请您登录
您对该条目有什么异议,请填写以下表单,管理员会尽快联系您。
内 容:
Email:  *
单位:
验证码:   刷新
您在IR的使用过程中有什么好的想法或者建议可以反馈给我们。
标 题:
 *
内 容:
Email:  *
验证码:   刷新

Items in IR are protected by copyright, with all rights reserved, unless otherwise indicated.

 

 

Valid XHTML 1.0!
Copyright © 2007-2017  中国科学院软件研究所 - Feedback
Powered by CSpace