Title: | 基于上下文的汉字简繁转换系统的实现 |
Author: | 刘汇丹
|
Issued Date: | 2007-06-01
|
Degree Grantor: | 中国科学院软件研究所
|
Place of Degree Grantor: | 软件研究所
|
Degree Level: | 博士
|
Keyword: | Unicode
; 简体中文
; 繁体中文
; 简繁转换
|
Alternative Title: | Implementation of Context Based Chinese to Chinese Conversion System
|
Abstract: | 由于历史的原因导致了海峡两岸四地存在一简一繁两种文字制度。近年来海峡两岸日趋广泛和深入的交流与合作,导致了对简繁转换系统的迫切需求,现有的简繁转换系统都存在这样或者那样的问题,不能满足正式文档简繁转换的需求。
本文围绕一个基于上下文的简繁转换系统的实现,主要做了如下工作:
第一,详细分析了简繁转换中涉及的编码、单字、词语等各方面存在的问题,分析了简繁转换的复杂性所在。提出了在Unicode基本多文种平面内完成简繁转换,并包括对BMP平面内非兼容区的所有汉字的简繁转换的支持。指出了多目标字词的存在导致了简繁转换的复杂性。
第二,指出了规范的中文分词将导致上下文信息的丢失,从而无法充分利用上下文信息保证简繁转换的正确性。提出了以词语转换消除单字转换歧义、以长词转换消除短词转换的歧义的转换策略。
第三,提出了简繁转换中单字转换与词语转换的协作。分析了部分多目标字的某些目标字只在个别或者少数词中使用的情况,当在词语转换中收录了这些词语以后,就可以将这个单字当作单目标字处理。
第四,提出了简繁转换正确性评估的方法。根据分级结果在输出时实行区别显示,使用户能够方便地找到哪些字词可能出现了错误,以便有针对性地进行校对。
第五,提出了将命名实体识别、搭配关系、词性标注等技术应用到简繁转换系统中,保证转换结果符合用户的需求,并保证系统转换的正确率。
第六,实现了基于上下文的简繁转换系统并给出了测试结果。 |
English Abstract: | Two kinds of chinese scripts exist among the four regions arcoss the two sides of Taiwan straits due to the history reasons. In recent years, communication and collaboration among four regions across the straits become more and more extensively and deeply, which leads to an urgent need of Chinese to Chinese conversion system.There are still many problems in current systems, as a result, none of them can meet the requirement fo formal document conversion.
This paper mainly focus on the implementation of context based Chinese to Chinese conversion system.
First, problems of different level such as encoding, character, word are analyzed thoroughly. Based on the analysis, the complexity of conversion between simplified Chinese and traditional Chinese is directed. The conversion can be done within the basic multilingual plane of Unicode. The ambiguity of character or word which has two or more targets leads to the complexity of the conversion.
Second, formal Chinese word segmentation will lead to loss of information, so the conversion accuracy is not available. It should be taken as the basic rule to disambiguate character with word conversion and disambiguate short word with long word conversion.
Third, collaboration of character and word conversion. For several Chinese characters with multiple targets, while including all words with some target characters except one, it can be taken as character with only one target and directly convert to it in character conversion level.
Fourth, evaluation method of conversion correctness is proposed. Output of conversion can be display in different style, which is very convenient for user to correct mistakes.
Fifth, technologys such as named entity recognition, part of speech are used in our conversion system. As a result, terms in names can be converted character by character. It guarantees the correctness of character conversion.
Sixth, conversion system is implemented and test. The effect of display in different style is shown by snapshot. |
Language: | 中文
|
Content Type: | 学位论文
|
URI: | http://ir.iscas.ac.cn/handle/311060/6532
|
Appears in Collections: | 中科院软件所
|
File Name/ File Size |
Content Type |
Version |
Access |
License |
|
10001_200428015029009刘汇丹_paper.pdf(1049KB) | -- | -- | 限制开放 | -- | 联系获取全文 |
|
Recommended Citation: |
刘汇丹. 基于上下文的汉字简繁转换系统的实现[D]. 软件研究所. 中国科学院软件研究所. 2007-06-01.
|
|
|