中国科学院软件研究所机构知识库
Advanced  
ISCAS OpenIR  > 中科院软件所  > 中科院软件所
题名:
汉语文本自适应压缩算法研究
作者: 王忠效
答辩日期: 2000
专业: 计算机软件与理论
授予单位: 中国科学院软件研究所
授予地点: 中国科学院软件研究所
学位: 博士
关键词: 数据压缩 ; 文本压缩 ; Markov模型 ; 有限上下文模型 ; 算术编码 ; 词典编码 ; 统计编码
摘要: 汉语文本压缩技术既可以减少汉语文本的存储空间和传输时间,又能够起到数据加密的作用,因此,可以广泛应用于数据通信、情报检索和电子出版等领域。研究汉语文本自适应压缩算法还是探讨汉语文本自适应模型的有效途径。汉语文本自适应模型是汉语分词、汉字识别、词性标注和文本自动校对等汉语自适应处理系统的重要基础。本文综述了数据压缩、尤其是文本压缩的研究与发展情况,并对汉语文本自适应压缩技术涉及到的若干重要问题进行了全面的研究。主要工作包括:1、鉴于符号结构方面的冗余特征是现代文本压缩技术实现数据压缩的主要着眼点,本文根据LZ77词典编码算法对英汉语文本的LZ短语的分布特征进行对比研究,明确了汉语文本压缩的研究方向:1)、汉语文本压缩算法目前应该是基于有限上下文模型的统计方法;2)、0阶上下文模型至关重要;3)、结合一部小型的静态词典进行词典编码,是提高编码效率的重要途径。2、从汉字大字符集的特点出发,本文提出了动态字母表、编辑距离和新符号加权区等相关概念,研究了动态字母表模型的基本性质,并据此提出汉语文本的动态字母表0阶模型的构造方法。该模型与算术编码相结合,其编码效率超过于基于LZ词典编码与动态Huffman编码的混合算法。3、由于汉字查找算法会严重影响各种自适应汉语自理系统的时间性能,本文提出了半自适应和自适应两种方式的汉字散列分组查找算法,其时间复杂度为O(1),有力地保护了汉语文本压缩算法的时间性能。4、为了提高对普通(短小)规模的汉语文本的编码效率,提出了根据静态词典进行词典编码预处理的汉语文本压缩技术,讨论了词典的构造方法,并实现了相应算法。5、以上述工作为基础,进一步建立汉语文本的高阶上下文模型算法WRPPMn,确定汉语文本有实际意义的最高阶上下文模型为1阶模型。同迄今汉语文本最优秀的压缩算法相比,无论对于短小的、还是巨大的汉语文本,WRPPM1的压缩性能都居于领先地位。此外,作为研究工作的一部分,我们建立了供测试汉语文本压缩算法用的汉语语料库CTCC98。它不仅规模大,涉及的文本广,而且主要收集普通规模的汉语文本。
英文摘要: Compression techniques of Chinese text can make Chinese texts take less space to store and less time to transmit. Being a kind of data encryption techniques, they can also improve network security. Therefore, compression techniques of CHinese text can play an important role in many fields, such as data communications, information retrieval and electronic publishing. In addition, research on the adaptive compression algorithms of Chinese text may be a good way to establish adaptive model of Chinese text, which provides the foundations for any adaptive Chinese processing systems, such as those of word segmentation, Chinese character recognition, part-of-speech tagging and so on. This dissertation reviewed the developments and researches of data compression, especially text compression, and did a comprehensive research on problems pertaining to adaptive compression techniques of Chinese text. The following are the main works. 1. Through comparison and analysis of the probability distribution of LZ phrases in both Chinese and English texts, the fundamental problems of developing high-performance compression algorithms of Chinese text were solved. We concluded that a practical algorithm would be based on finite-context models, of which the order-0 model had crucial influences on the coding efficiency of the algorithm; and, adding dictionary coding, which adopted a small volume of static dictionary, as an accessorial component was almost the most important way to obtain better results when encoding Chinese texts of normal size. 2. In accordance with the characteristics of the large alphabet of Chinese characters, we proposed dynamic alphabet and related concepts, such as edit distance and novel-symbol-weighted area, and studied the features of dynamic-alphabet models. We accordingly described a method of building dynamic-alphabet order-0 model of Chinese text. Based on this model and arithmetic coding, a real program showed better compression ratio with Chinese texts than those that combined the famous Lempel-Ziv algorithm and dynamic Huffman coding. 3. Because the search algorithm of Chinese characters affects greatly the time performance of an adaptive Chinese processing system, both semi-adaptive and adaptive hash algorithms of Chinese characters were proposed and studied. With O(1) as their average search length, they were inseparable parts of our compression algorithms. 4. To improve the compression performance with Chinese texts of normal size, a technique that, using a static dictionary, preprocessed input text through dictionary coding, was proposed. The structures of such a dictionary were discussed and an algorithm implemented. 5. Based on all the above works, finite-context models of high order were further investigated and implemented. It was found that, for adaptive compression algorithms of Chinese text, the highest order of practical value was order 1. Compared with nowadays' most excellent algorithms of Chinese text compression, the algorithm presented in this dissertation exhibited higher compression performance, no matter whether the encoded Chinese texts were of small size or large size. Besides, a large texts corpus of Chinese texts, CTCC98, was compiled from normal size of samples taken from different sources of different genres and styles of writing.
语种: 中文
内容类型: 学位论文
URI标识: http://ir.iscas.ac.cn/handle/311060/7222
Appears in Collections:中科院软件所

Files in This Item:
File Name/ File Size Content Type Version Access License
LW002977.pdf(2027KB)----限制开放-- 联系获取全文

Recommended Citation:
王忠效. 汉语文本自适应压缩算法研究[D]. 中国科学院软件研究所. 中国科学院软件研究所. 2000-01-01.
Service
Recommend this item
Sava as my favorate item
Show this item's statistics
Export Endnote File
Google Scholar
Similar articles in Google Scholar
[王忠效]'s Articles
CSDL cross search
Similar articles in CSDL Cross Search
[王忠效]‘s Articles
Related Copyright Policies
Null
Social Bookmarking
Add to CiteULike Add to Connotea Add to Del.icio.us Add to Digg Add to Reddit
所有评论 (0)
暂无评论
 
评注功能仅针对注册用户开放,请您登录
您对该条目有什么异议,请填写以下表单,管理员会尽快联系您。
内 容:
Email:  *
单位:
验证码:   刷新
您在IR的使用过程中有什么好的想法或者建议可以反馈给我们。
标 题:
 *
内 容:
Email:  *
验证码:   刷新

Items in IR are protected by copyright, with all rights reserved, unless otherwise indicated.

 

 

Valid XHTML 1.0!
Copyright © 2007-2017  中国科学院软件研究所 - Feedback
Powered by CSpace