ISCAS OpenIR  > 中科院软件所  > 中科院软件所
Author: 冯元勇
Issued Date: 2008-01-10
Degree Grantor: 中国科学院软件研究所
Place of Degree Grantor: 软件研究所
Degree Level: 博士
Keyword: 命名实体识别 ; 实体提及识别 ; 实体提及共指消解 ; 机器学习 ; 实体检测与识别 ; 文本内容抽取
Alternative Title: Study on Some Key Issues in Chinese Entity Mention Detection and Coreference Resolution
Abstract: 随着计算机应用的深入和互联网的发展,人们可获得的信息量急剧增加,如何自动有效地从网络资源中找到用户真正需要的信息成为当前的迫切需求。文本内容抽取提供了一种对文档实体等信息进行有效提取、结构化的手段。文本内容抽取的关键技术包括实体提及的识别及其共指消解。 与西方语言相比,中文缺乏明显的语法提示特征,因此中文实体提及识别与共指消解更为困难。本文以中文自由文本为研究对象,在最大熵框架下,对以名称型实体提及为主的实体提及识别及其共指消解关键问题开展了深入研究。主要研究工作和创新点如下: (1) 将尾字提示特征引入到中文地名和机构名一体化名称型实体提及识别框架中。提出了对常规提示词具有良好泛化能力、又可有效避开分词错误的尾字特征。这些特征对识别难度较大的地名和机构名两类名称型提及具有良好的提示作用。在863和SIGHAN语料上的实验表明,该特征与词类特征具有一定的互补性,联合使用可以以较小的训练代价,显著提高各类名称型提及的识别性能,特别是机构名的识别精度。 (2) 提出了基于感知器和邻近标签语义约束的条件随机场模型快速训练算法。针对条件随机场模型训练复杂度高、收敛慢的问题,提出了一种改进的快速算法。首先通过引入小规模用字特征(主要为地名和机构名的尾字特征)降低特征的规模,然后通过感知器算法获得更合理的初始参数集,最后通过在推理过程中引入任务相关的人工知识压缩Viterbi和Baum-Welch格(trellis)搜索空间。在中文863评测语料和SIGHAN06语料集上进行的实验表明,该算法在不影响中文名称型实体提及识别精度的同时,总训练时间减少了近40 %。 (3) 提出了基于分类信心重排序的中文实体提及共指消解全局最优逼近算法。针对常规分步共指消解框架中分类器缺乏全局信息的不足,提出了根据分类信心来调整共指链生成(聚类)顺序的新方案。该方案依据分类信心对全体提及配对进行排序,优先保证可靠的分类结果对提及进行聚集或分离,使得聚类过程向全局最优逼近。实验表明,分类信心的引入,在多个学习框架下显著改善了系统的整体消解性能。
English Abstract: Along with rapid growth in amount of information available from the Internet, there is an urgent demand to aid users search among and fetch from various sources effectively. Automatic content extraction is such a task to collect information of the objects extensionally pre-determined outside the computer world in terms of attributes, entities, relations, and events, and thus to mine application-independent structural in-formation from language resources. Among its subtasks, the most fundamental ones include entity mention detection and coreference resolution. For the lack of grammatical hints in Chinese, these subtasks are even more diffi-cult than their counterparts in west languages. In this dissertation, some key issues in the task of free text content extraction in Chinese are studied, under maximum en-tropy framework. The main points are as follows: (1) A feature selection method is proposed based on tail hint characters of loca-tion and organization names in an unified Chinese named entity mention detection framework. Location and organization names usually span large extent and have complex structure, entailing great maneuver skills. In this dissertation, some tail hint characters are proposed, generalizing most usual hint words of both types. The ex-periments on standard corpora show that the features complements part-of-speech and thus enhance the balance among named entity mention types, preserving improvement on precision of organization names, when they are combined. (2) A fast algorithm is proposed to train CRFs based on perceptron parameter initialization and heuristic rules drawn from neighbor label constraints. The common parameter estimation procedure for CRFs model suffers from high complexity and slow convergence. This dissertation proposes an algorithm that trains CRFs more quickly. First, small-scale hint character features are introduced to decrease the feature space. Then, a perceptron parameter initialization is applied. Last, heuristic task-specific rules are imposed to reduce search path space in Viterbi and Baum-Welch procedure. The experiments on China 863 program NER and SIGHAN 2006 corpora show that our schedule saves about 40% training time. (3) A novel global optimized approximation algorithm is proposed to Chinese entity mention grouping procedure based on reranking of pairwise coreference resolu-tions. We introduce global confidence controlling the order in which mentions enter the clustering procedures. The essence of the algorithm is to make sure that decisions more reliable will take effects in higher priority and thus the system is more likely to reach global optimal resolutions. Experiments show that the considerations on confi-dence promote system significantly under both isolated and joint supervised learning framework.
Language: 中文
Content Type: 学位论文
Appears in Collections:中科院软件所

Files in This Item:
File Name/ File Size Content Type Version Access License
10001_200218015003294冯元勇_paper.doc(2872KB)----限制开放-- 联系获取全文

Recommended Citation:
冯元勇. 中文实体提及识别及其共指消解关键技术研究[D]. 软件研究所. 中国科学院软件研究所. 2008-01-10.
Recommend this item
Sava as my favorate item
Show this item's statistics
Export Endnote File
Google Scholar
Similar articles in Google Scholar
[冯元勇]'s Articles
CSDL cross search
Similar articles in CSDL Cross Search
[冯元勇]‘s Articles
Related Copyright Policies
Social Bookmarking
Add to CiteULike Add to Connotea Add to Add to Digg Add to Reddit
所有评论 (0)
内 容:
Email:  *
验证码:   刷新
标 题:
内 容:
Email:  *
验证码:   刷新

Items in IR are protected by copyright, with all rights reserved, unless otherwise indicated.



Valid XHTML 1.0!
Copyright © 2007-2021  中国科学院软件研究所 - Feedback
Powered by CSpace