中国科学院软件研究所机构知识库
Advanced  
ISCAS OpenIR  > 中科院软件所  > 中科院软件所
题名:
基于统计的汉语词性标注方法的研究与实现
作者: 魏欧
答辩日期: 1998
专业: 计算机软件
授予单位: 中国科学院软件研究所
授予地点: 中国科学院软件研究所
学位: 博士
关键词: 词性标注 ; n元语法 ; Markov模型
摘要: 近年来,随着计算机技术的发展和可以获得的语料库数量的不断增大,基于统计的自然语言处理技术逐渐成为计算语言学中的一个研究热点。词性标注在许多应用领域中都是一个重要的实际问题,也是自然语言处理中的一个基础课题,对词性标注方法的研究具有很强的实际和理论意义。论文从各个方面对基于统计的汉语词性标注技术进行了探讨,实现了一个汉语自动词性标注系统。论文首先分析了汉语兼类词的特点,并讨论了汉语词类划分的依据和选择词性标记集的一些相关问题;给出了用于词性标注的自然语言n元语法模型,对基于动态规划的Viterbi标注算法进行了分析和描述。然后,论文从监督训练和非监督训练方式两方面对基于统计的汉语词性标注方法进行了研究。对于监督训练方式,论文首先实现了一种目前常用的相对频率训练标注模式-RF_Basic,并从词性概率矩阵与词汇概率矩阵的结构和数值变化等方面对训练集规模与标注正确率之间所存在的非线性关系做了分析,针对这种非线性关系,为了充分利用训练集,提高标注正确率,论文通过对RF_Basic模式下的标注结果的分析,从利用词语相关的语法属性,加强对易错词性词语和未知词的处理三个方面加以改进,得到了一个增强的监督训练标注模式-RF_Enhenced,提高了标注性能,封闭测试和开放测试的正确率分别达到96.5%和96%;对于非监督方式,目前国内还没有这方面的实验报告,为此,论文对非监督下的汉语词性标注做了一些分析。论文首先介绍了采用隐型马尔可夫模型(HMM)进行统计训练的Baum_Welch方法,实现了一个非监督的训练标注模式-HMM_Basic,然后从不同的初始模型的选择对系统标注性能的影响,讨论了其中所存在的问题。论文最后介绍了系统的整体结构,语词表、词性标记和分类词典的组织,对稀疏矩阵的处理等具体实现时的一些方法。
英文摘要: In recent years, with the development of computer technology and more large corpus available the techniques of statistics-based natural language processing becomes one of the most actively researched project in computational linguistics. Part-of-Speech tagging is an important practical problem with potential applications in many areas and a basic question for discussion in Natural Language Processing. In this paper, we studied the statistics-based methods applied to chinese part-of-speech tagging from various aspects, and realized a Chinese par-of-speech tagging system. We first discussed the part of speech ambiguity phenomena of Chinese, the standard of determining chinese part of speech, and the related problems of selecting a tag set; We introduced the n-gram model used in statistical methods and the dynamic programming solution - Viterbi algorithm. Then, we studied the statistics-based part-of-speech tagging form supervised & unsupervised aspects; For supervised approach, we first realized a popular Relative Frequency training method -RF_-Basic, and studied the nonlinear relation between training set and tagging accuracies form the aspects of part-of-speech matrix & words matrix. Based on the error analysis of the basic training and tagging way we improved it form three aspects: using other grammatical attributes of words, strengthening the processing for the words easy to be tagged wrong and unknown words and got an enhanced supervised training and tagging way which increased the tagging accuracies. For the enhanced way, open test and close test showed that the over all accuracies are about 96.5% and 96%; For unsupervised approach, we first introduced the Baoum-Welch method used to train a Hidden Markov Model, and performed some experiments from various initial points. We analysed the results, giving some comments about the problems existing in the approach to Chinese part-of-speech tagging, such the tag set size, initial models etc. Finally we described the implementation techniques of our tagging system.
语种: 中文
内容类型: 学位论文
URI标识: http://ir.iscas.ac.cn/handle/311060/5924
Appears in Collections:中科院软件所

Files in This Item:
File Name/ File Size Content Type Version Access License
N98846.pdf(2325KB)----限制开放-- 联系获取全文

Recommended Citation:
魏欧. 基于统计的汉语词性标注方法的研究与实现[D]. 中国科学院软件研究所. 中国科学院软件研究所. 1998-01-01.
Service
Recommend this item
Sava as my favorate item
Show this item's statistics
Export Endnote File
Google Scholar
Similar articles in Google Scholar
[魏欧]'s Articles
CSDL cross search
Similar articles in CSDL Cross Search
[魏欧]‘s Articles
Related Copyright Policies
Null
Social Bookmarking
Add to CiteULike Add to Connotea Add to Del.icio.us Add to Digg Add to Reddit
所有评论 (0)
暂无评论
 
评注功能仅针对注册用户开放,请您登录
您对该条目有什么异议,请填写以下表单,管理员会尽快联系您。
内 容:
Email:  *
单位:
验证码:   刷新
您在IR的使用过程中有什么好的想法或者建议可以反馈给我们。
标 题:
 *
内 容:
Email:  *
验证码:   刷新

Items in IR are protected by copyright, with all rights reserved, unless otherwise indicated.

 

 

Valid XHTML 1.0!
Copyright © 2007-2017  中国科学院软件研究所 - Feedback
Powered by CSpace