中国科学院软件研究所机构知识库
Advanced  
ISCAS OpenIR  > 基础软件国家工程研究中心  > 学位论文
题名:
基于半马尔科夫条件随机场的命名体识别及其关系抽取研究
作者: 黄瑞红
答辩日期: 2008-06-04
导师: 孙乐
专业: 计算机软件与理论
授予单位: 中国科学院研究生院
授予地点: 中国科学院软件研究所
学位: 硕士
关键词: 命名体识别 ; Semi-Markov CRF 模型 ; 实体关系抽取 ; 核方法
其他题名: A Study of Named Entity Recognition and Relation Extraction based on Semi-Markov Conditional Random Fields
分类号: 暂无
索取号: 暂无
部门归属: 基础软件国家工程研究中心
摘要: 随着互联网和电子化办公的发展,出现了大量的文本资源。信息抽取技术可以帮助人们快速获取大规模文本中的有用信息。命名体识别与关系抽取是信息抽取的两个基本任务。本文在调研当前命名体识别和实体关系抽取中采用的主要方法的基础上,分别给出了解决方案。论文开展的主要工作有:(1)从模型选择和特征选择两个方面总结了命名体识别及实体关系抽取的国内外研究现状,重点介绍用于命名体识别的统计学习方法以及用于实体关系抽取的基于核的方法。(2)针对当前命名体识别中命名体片段边界的确定问题,研究了如何将 Semi-Markov CRFs 模型应用于中文命名体识别。这种模型只要求段间遵循马尔科夫规则,而段内的文本之间则可以被灵活的赋予各种规则。将这种模型用于中文命名体识别任务时,我们可以更有效更自由的设计出各种有利于识别出命名体片段边界的特征。实验表明,加入段相关的特征后,命名体识别的性能提高了 4-5 个百分点。(3)实体关系抽取的任务是判别两个实体之间的语义关系。之前的研究已经表明,待判别关系的两个实体间的语法树结构对于确定二者的关系类别是非常有用的,而相对成熟的基于平面特征的关系抽取方法在充分提取语法树结构特征方面的能力有限,因此,本文研究了基于核的中文实体关系抽取方法。针对中文特点,我们探讨了卷积(Convolution)核中使用不同的语法树对中文实体关系抽取性能的影响,构造了几种基于卷积核的复合核,改进了最短路依赖核。因为核方法开始被用于英文关系抽取时,F1 值也只有40%左右,而我们只使用作用在语法树上的卷积核时,中文关系抽取的F1 值达到了35%,可见核方法对中文关系抽取也是有效的。
英文摘要: Information Extraction (IE) aims to extract structural information from free texts. Named Entity Recognition and Relation Extraction are two basic tasks in IE. Based on the investigation of the related work, this paper proposes the following solutions dealing with problems of each task. Specifically, the paper contributes in the following three points: 1. This paper gives an overview on the states-of-art of named entity recognition and relation extraction. It mainly investigates variety of models and features used for two tasks. Based on this, some problems facing named entity recognition and relation extraction are analyzed. 2. Named Entity Recognition needs to locate the boundaries of various named entity mentions exactly; however, the current main-stream methods for NER lack capability to tackle with this problem. In this paper, we will explore the recently proposed semi-CRFs for Chinese Named Entity Recognition. In Semi-Markov CRFs, features allow us to consider properties of segments and transitions within a segment can be non-Markovian, therefore, we can design features to locate the boundaries of named entity mentions more freely and more effectively. Experimental results show the special features used in semi-CRFs can improve the recognition performance. 3. Relation Extraction, aiming to detect the pairs of entities related and point out the types of their semantic relations, relies on the structural information of the parse tree where the entity pair occurs a lot. However, it’s hard for the traditional feature-based methods for relation extraction to represent the tree structural information as flat features. Therefore, we investigate the kernel-based methods for Chinese relation extraction systematically. Considering the special properties of Chinese, we explore the performances of different parse tree spans involved in convolution kernel and experiment with composite kernels in order to discuss the complementary effects between tree kernel and flat kernels, besides, we improve shortest path dependency kernel by replacing the strict same length requirement with finding the longest common subsequences between two shortest dependency paths. Experiments show kernel-based methods are effective for Chinese Relation Extraction.
内容类型: 学位论文
URI标识: http://ir.iscas.ac.cn/handle/311060/6036
Appears in Collections:基础软件国家工程研究中心_学位论文

Files in This Item:
File Name/ File Size Content Type Version Access License
10001_200528015029053黄瑞红_paper.pdf(425KB)----限制开放-- 联系获取全文

Recommended Citation:
黄瑞红. 基于半马尔科夫条件随机场的命名体识别及其关系抽取研究[D]. 中国科学院软件研究所. 中国科学院研究生院. 2008-06-04.
Service
Recommend this item
Sava as my favorate item
Show this item's statistics
Export Endnote File
Google Scholar
Similar articles in Google Scholar
[黄瑞红]'s Articles
CSDL cross search
Similar articles in CSDL Cross Search
[黄瑞红]‘s Articles
Related Copyright Policies
Null
Social Bookmarking
Add to CiteULike Add to Connotea Add to Del.icio.us Add to Digg Add to Reddit
所有评论 (0)
暂无评论
 
评注功能仅针对注册用户开放,请您登录
您对该条目有什么异议,请填写以下表单,管理员会尽快联系您。
内 容:
Email:  *
单位:
验证码:   刷新
您在IR的使用过程中有什么好的想法或者建议可以反馈给我们。
标 题:
 *
内 容:
Email:  *
验证码:   刷新

Items in IR are protected by copyright, with all rights reserved, unless otherwise indicated.

 

 

Valid XHTML 1.0!
Copyright © 2007-2017  中国科学院软件研究所 - Feedback
Powered by CSpace