中国科学院软件研究所机构知识库
Advanced  
ISCAS OpenIR  > 中科院软件所  > 中科院软件所
题名:
基于上下文的汉字简繁转换系统的实现
作者: 刘汇丹
答辩日期: 2007-06-01
授予单位: 中国科学院软件研究所
授予地点: 软件研究所
学位: 博士
关键词: Unicode ; 简体中文 ; 繁体中文 ; 简繁转换
其他题名: Implementation of Context Based Chinese to Chinese Conversion System
摘要: 由于历史的原因导致了海峡两岸四地存在一简一繁两种文字制度。近年来海峡两岸日趋广泛和深入的交流与合作,导致了对简繁转换系统的迫切需求,现有的简繁转换系统都存在这样或者那样的问题,不能满足正式文档简繁转换的需求。 本文围绕一个基于上下文的简繁转换系统的实现,主要做了如下工作: 第一,详细分析了简繁转换中涉及的编码、单字、词语等各方面存在的问题,分析了简繁转换的复杂性所在。提出了在Unicode基本多文种平面内完成简繁转换,并包括对BMP平面内非兼容区的所有汉字的简繁转换的支持。指出了多目标字词的存在导致了简繁转换的复杂性。 第二,指出了规范的中文分词将导致上下文信息的丢失,从而无法充分利用上下文信息保证简繁转换的正确性。提出了以词语转换消除单字转换歧义、以长词转换消除短词转换的歧义的转换策略。 第三,提出了简繁转换中单字转换与词语转换的协作。分析了部分多目标字的某些目标字只在个别或者少数词中使用的情况,当在词语转换中收录了这些词语以后,就可以将这个单字当作单目标字处理。 第四,提出了简繁转换正确性评估的方法。根据分级结果在输出时实行区别显示,使用户能够方便地找到哪些字词可能出现了错误,以便有针对性地进行校对。 第五,提出了将命名实体识别、搭配关系、词性标注等技术应用到简繁转换系统中,保证转换结果符合用户的需求,并保证系统转换的正确率。 第六,实现了基于上下文的简繁转换系统并给出了测试结果。
英文摘要: Two kinds of chinese scripts exist among the four regions arcoss the two sides of Taiwan straits due to the history reasons. In recent years, communication and collaboration among four regions across the straits become more and more extensively and deeply, which leads to an urgent need of Chinese to Chinese conversion system.There are still many problems in current systems, as a result, none of them can meet the requirement fo formal document conversion. This paper mainly focus on the implementation of context based Chinese to Chinese conversion system. First, problems of different level such as encoding, character, word are analyzed thoroughly. Based on the analysis, the complexity of conversion between simplified Chinese and traditional Chinese is directed. The conversion can be done within the basic multilingual plane of Unicode. The ambiguity of character or word which has two or more targets leads to the complexity of the conversion. Second, formal Chinese word segmentation will lead to loss of information, so the conversion accuracy is not available. It should be taken as the basic rule to disambiguate character with word conversion and disambiguate short word with long word conversion. Third, collaboration of character and word conversion. For several Chinese characters with multiple targets, while including all words with some target characters except one, it can be taken as character with only one target and directly convert to it in character conversion level. Fourth, evaluation method of conversion correctness is proposed. Output of conversion can be display in different style, which is very convenient for user to correct mistakes. Fifth, technologys such as named entity recognition, part of speech are used in our conversion system. As a result, terms in names can be converted character by character. It guarantees the correctness of character conversion. Sixth, conversion system is implemented and test. The effect of display in different style is shown by snapshot.
语种: 中文
内容类型: 学位论文
URI标识: http://ir.iscas.ac.cn/handle/311060/6532
Appears in Collections:中科院软件所

Files in This Item:
File Name/ File Size Content Type Version Access License
10001_200428015029009刘汇丹_paper.pdf(1049KB)----限制开放-- 联系获取全文

Recommended Citation:
刘汇丹. 基于上下文的汉字简繁转换系统的实现[D]. 软件研究所. 中国科学院软件研究所. 2007-06-01.
Service
Recommend this item
Sava as my favorate item
Show this item's statistics
Export Endnote File
Google Scholar
Similar articles in Google Scholar
[刘汇丹]'s Articles
CSDL cross search
Similar articles in CSDL Cross Search
[刘汇丹]‘s Articles
Related Copyright Policies
Null
Social Bookmarking
Add to CiteULike Add to Connotea Add to Del.icio.us Add to Digg Add to Reddit
所有评论 (0)
暂无评论
 
评注功能仅针对注册用户开放,请您登录
您对该条目有什么异议,请填写以下表单,管理员会尽快联系您。
内 容:
Email:  *
单位:
验证码:   刷新
您在IR的使用过程中有什么好的想法或者建议可以反馈给我们。
标 题:
 *
内 容:
Email:  *
验证码:   刷新

Items in IR are protected by copyright, with all rights reserved, unless otherwise indicated.

 

 

Valid XHTML 1.0!
Copyright © 2007-2017  中国科学院软件研究所 - Feedback
Powered by CSpace