中国科学院软件研究所机构知识库
Advanced  
ISCAS OpenIR  > 基础软件国家工程研究中心  > 学位论文
题名:
基于词汇链与文本切分的更新型多文档摘要技术研究
作者: 李靖
答辩日期: 2008-06-04
导师: 孙乐
专业: 计算机软件与理论
授予单位: 中国科学院研究生院
授予地点: 中国科学院软件研究所
学位: 硕士
关键词: 自动文本摘要 ; 查询型摘要 ; 更新型摘要 ; 词汇链
分类号: 暂无
索取号: 暂无
部门归属: 基础软件国家工程研究中心
摘要: 本文从文本摘要的背景和概念入手,着重介绍了新近出现的两种自动文本摘要类型:查询型摘要和更新型摘要。接着从基于外部特征、基于简单语义分析和基于深度语义分析三个方面详细给出了自动文本摘要技术的主要方法,以及目前国际上普遍采用的三种自动摘要评价方法:ROUGE,Pyramid,BE。本文以基于词汇链的摘要方法为研究重点,在描述词汇链构建算法、词汇链摘要算法及其优化算法的基础上,主要从以下四个方面展开了研究工作: 1) 将用户查询关键词词义序列引入词汇链评分算法,利用词汇链与查询关键词词义序列的语义相似度为词汇链评分,并结合经典的Strongest Chain评分方法以获得能同时满足正确体现原文含义并且符合用户查询的词汇链。 2) 利用词汇链是原始文档含义的中间表示这一特性,通过对来自历史文档和待摘要文档这两方面的词汇链进行相似度计算,实现了历史信息与新信息的分离,并依此生成了更新型摘要。 3) 对旧词汇链结构进行扩展,使之同时包含链成员来源句段,并且通过引入文本切分研究领域的TextTiling算法,对同一条链的链成员来源句段所组成的临时短文进行切分,以切分结果作为摘要候选,实现了句子抽取的摘要生成方法和段落抽取的摘要生成方法的结合,提高了生成摘要的良构性。 4) 依据上述算法改良了原有的自动文本摘要系统,并与跨语言检索系统相结合实现了一个中英跨语言新闻摘要系统。
英文摘要: In this thesis we introduced the background and main concept of automatic text summarization at first and mainly described two summarization task: query-focused (or topic-based) summarization and update-style summarization. We discussed and analyzed existing summarization methods in three categories, based on abstraction, based on simple semantic analysis and based on deep semantic analysis. And we also introduced three main automatic methods for summary evaluation, ROUGE, Pyramid and BE. Then, we focused on the discussions of Lexical Chain approach in summarization and optimized approach. We deeply analyzed the virtues and faults of these approaches and proposed improvement strategies in three aspects as follows. Firstly, we scored lexical chains by calculating semantic similarities between chains and query term sequences. By combining this scoring strategy and the classic scoring method we can find the strongest chain, which can represent raw text and suit query term better. Secondly, we divided historical information (or outdated information) and new information by calculating similarities between lexical chains from each side’s documents. Then we can extract candidate sentences for update-style summary. Thirdly, we introduced TextTiling algorithm into Lexical Chain approach and developed the chain structure into rich chain, which contains sentences where chain elements come from. By segmenting sentence set contained by rich chain using TextTiling algorithm we extracted segmentations as candidates for summary. Through this we can benefit from both virtues from sentence-based extraction and paragraph-based extraction. At the last part of this thesis we implemented improvement strategies in our summarization system and utilized our system in Chinese summarization as well. By combining our summarizer and cross-language IR technology we developed our News Browser System.
内容类型: 学位论文
URI标识: http://ir.iscas.ac.cn/handle/311060/6276
Appears in Collections:基础软件国家工程研究中心_学位论文

Files in This Item:
File Name/ File Size Content Type Version Access License
10001_200528015029030李靖_paper.doc(3572KB)----限制开放-- 联系获取全文

Recommended Citation:
李靖. 基于词汇链与文本切分的更新型多文档摘要技术研究[D]. 中国科学院软件研究所. 中国科学院研究生院. 2008-06-04.
Service
Recommend this item
Sava as my favorate item
Show this item's statistics
Export Endnote File
Google Scholar
Similar articles in Google Scholar
[李靖]'s Articles
CSDL cross search
Similar articles in CSDL Cross Search
[李靖]‘s Articles
Related Copyright Policies
Null
Social Bookmarking
Add to CiteULike Add to Connotea Add to Del.icio.us Add to Digg Add to Reddit
所有评论 (0)
暂无评论
 
评注功能仅针对注册用户开放,请您登录
您对该条目有什么异议,请填写以下表单,管理员会尽快联系您。
内 容:
Email:  *
单位:
验证码:   刷新
您在IR的使用过程中有什么好的想法或者建议可以反馈给我们。
标 题:
 *
内 容:
Email:  *
验证码:   刷新

Items in IR are protected by copyright, with all rights reserved, unless otherwise indicated.

 

 

Valid XHTML 1.0!
Copyright © 2007-2017  中国科学院软件研究所 - Feedback
Powered by CSpace