ISCAS OpenIR  > 中科院软件所  > 中科院软件所
基于上下文的汉字简繁转换系统的实现
Alternative TitleImplementation of Context Based Chinese to Chinese Conversion System
刘汇丹
2007-06-01
Degree Grantor中国科学院软件研究所
Degree Level博士
Place of Degree Grantor软件研究所
KeywordUnicode 简体中文 繁体中文 简繁转换
English Abstract由于历史的原因导致了海峡两岸四地存在一简一繁两种文字制度。近年来海峡两岸日趋广泛和深入的交流与合作,导致了对简繁转换系统的迫切需求,现有的简繁转换系统都存在这样或者那样的问题,不能满足正式文档简繁转换的需求。 本文围绕一个基于上下文的简繁转换系统的实现,主要做了如下工作: 第一,详细分析了简繁转换中涉及的编码、单字、词语等各方面存在的问题,分析了简繁转换的复杂性所在。提出了在Unicode基本多文种平面内完成简繁转换,并包括对BMP平面内非兼容区的所有汉字的简繁转换的支持。指出了多目标字词的存在导致了简繁转换的复杂性。 第二,指出了规范的中文分词将导致上下文信息的丢失,从而无法充分利用上下文信息保证简繁转换的正确性。提出了以词语转换消除单字转换歧义、以长词转换消除短词转换的歧义的转换策略。 第三,提出了简繁转换中单字转换与词语转换的协作。分析了部分多目标字的某些目标字只在个别或者少数词中使用的情况,当在词语转换中收录了这些词语以后,就可以将这个单字当作单目标字处理。 第四,提出了简繁转换正确性评估的方法。根据分级结果在输出时实行区别显示,使用户能够方便地找到哪些字词可能出现了错误,以便有针对性地进行校对。 第五,提出了将命名实体识别、搭配关系、词性标注等技术应用到简繁转换系统中,保证转换结果符合用户的需求,并保证系统转换的正确率。 第六,实现了基于上下文的简繁转换系统并给出了测试结果。
AbstractTwo kinds of chinese scripts exist among the four regions arcoss the two sides of Taiwan straits due to the history reasons. In recent years, communication and collaboration among four regions across the straits become more and more extensively and deeply, which leads to an urgent need of Chinese to Chinese conversion system.There are still many problems in current systems, as a result, none of them can meet the requirement fo formal document conversion. This paper mainly focus on the implementation of context based Chinese to Chinese conversion system. First, problems of different level such as encoding, character, word are analyzed thoroughly. Based on the analysis, the complexity of conversion between simplified Chinese and traditional Chinese is directed. The conversion can be done within the basic multilingual plane of Unicode. The ambiguity of character or word which has two or more targets leads to the complexity of the conversion. Second, formal Chinese word segmentation will lead to loss of information, so the conversion accuracy is not available. It should be taken as the basic rule to disambiguate character with word conversion and disambiguate short word with long word conversion. Third, collaboration of character and word conversion. For several Chinese characters with multiple targets, while including all words with some target characters except one, it can be taken as character with only one target and directly convert to it in character conversion level. Fourth, evaluation method of conversion correctness is proposed. Output of conversion can be display in different style, which is very convenient for user to correct mistakes. Fifth, technologys such as named entity recognition, part of speech are used in our conversion system. As a result, terms in names can be converted character by character. It guarantees the correctness of character conversion. Sixth, conversion system is implemented and test. The effect of display in different style is shown by snapshot.
Pages64
Language中文
Content Type学位论文
URIhttp://ir.iscas.ac.cn/handle/311060/6532
Collection中科院软件所_中科院软件所
Recommended Citation
GB/T 7714
刘汇丹. 基于上下文的汉字简繁转换系统的实现[D]. 软件研究所. 中国科学院软件研究所,2007.
Files in This Item:
File Name/Size DocType Version Access License
10001_20042801502900(1049KB) 限制开放--Application Full Text
Related Services
Recommend this item
Bookmark
Usage statistics
Export to Endnote
Google Scholar
Similar articles in Google Scholar
[刘汇丹]'s Articles
Baidu academic
Similar articles in Baidu academic
[刘汇丹]'s Articles
Bing Scholar
Similar articles in Bing Scholar
[刘汇丹]'s Articles
Terms of Use
No data!
Social Bookmark/Share
All comments (0)
No comment.
 

Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.