中国科学院软件研究所机构知识库
Advanced  
ISCAS OpenIR  > 基础软件国家工程研究中心  > 学位论文
学科主题: 计算机应用 ; 计算机应用::中国语言文字信息处理(包括汉字信息处理)
题名:
基于词典的汉藏句子自动对齐
作者: 于新
答辩日期: 2011-06-01
导师: 吴健
专业: 计算机软件与理论
授予单位: 中国科学院研究生院
授予地点: 北京
学位: 硕士
关键词: 汉藏句子对齐 ; 词典 ; 藏文分句 ; 分词粒度 ; 平行语料库
部门归属: 基础软件国家工程研究中心
项目归属: 中国科学院西部行动计划资助项目 汉藏辅助翻译系统研发与应用(KGCX2-YW-512)
摘要: 自然语言处理的许多研究都是从大规模的语料库提取语言知识的规律,然后利用这些规律来指导自然语言处理的过程。双语语料库含有两种语言之间的互译对照信息,对于这两种语言之间的双语自然语言处理具有极其重要的研究和应用价值。双语语料库加工的关键技术之一是对齐,构建句子级别的对齐语料是构建语料库最基本的任务。
本文参考其他语言句子对齐的成熟的方法,针对藏文语言的特殊性,提出基于词典的汉藏句子自动对齐。藏文的分句问题是句子对齐首先要解决的问题,在统计实验的基础上对藏文分句问题进行了探索性研究,论文中结合句尾词、句首连词以及句子长度信息来判断藏文句子边界。根据藏文的数词组成规则,实现了对藏文文本中数字编号的切分。整理了对齐所用双语词典,并对其词语覆盖率进行了评价。汉藏句子对齐遇到的另一问题是汉语与藏文的分词粒度不同,采用在藏汉词典中进一步查词并在汉语句子中比对的方法,使正确句对的得分增加,从而提高对齐正确率。采用该方法准确率为 81.11%。并对所使用算法进行了效率优化。
本文的目的在于建立一个实用的汉藏双语句子对齐工具,并利用此工具对汉藏双语语料库进行处理。实验结果说明,本文的方法建立的汉藏双语句子对齐工具具有重要的实用价值和研究价值。
英文摘要: Many studies of Natural Language Processing (NLP) are process that extracting language knowledge from large-scale corpus and then applying it to the NLP process. Bilingual corpus contains translation information between two kinds of language, and it can provide very valuable information for bilingual researches. The key technology to create bilingual corpus is alignment. In creation of bilingual parallel corpus, research of alignment at sentence level is a basic topic.
This paper refers to other languages mature methods sentence alignment, considering the specific characteristic of Tibetan language, and proposes dictionary-based Chinese-Tibetan sentence automatic alignment. The first problem need to be solved in Chinese-Tibetan sentence alignment is Tibetan sentence recognition. We do exploratory research based on statistical experiment. In this paper, it splits Tibetan sentences according to typical words at the end of a sentence which are verbs, auxiliary verbs or mood words, etc. Conjunctions at the beginning of a sentence and sentence length information also contribute to sentence recognition. According to the rules of Tibetan Numerals, implement the splitting of numerical symbols in the Tibetan text. Organize the bilingual dictionary for alignment, and evaluate the rate of its word coverage. Another problem accounted in Chinese-Tibetan sentence alignment is the different granularity between Chinese word segmentation and Tibetan word segmentation, and the problem is solved by further looking up the remaining big Tibetan word in Tibetan-Chinese dictionary and then matching it in the original Chinese sentence. Experiment shows the average precision of this approach is 81.11%. And efficiency optimization is applied in the method.
The purpose of this paper is to establish a practical Chinese-Tibetan bilingual sentence alignment tool, and apply it in the construction of Chinese-Tibetan bilingual parallel corpus.
The experimental results demonstrate that the sentence alignment tool in accordance with the method proposed in this paper has important practical and research value.
内容类型: 学位论文
URI标识: http://ir.iscas.ac.cn/handle/311060/10795
Appears in Collections:基础软件国家工程研究中心_学位论文

Files in This Item:
File Name/ File Size Content Type Version Access License
基于词典的汉藏句子自动对齐最终版——于新.pdf(1658KB)----限制开放 联系获取全文

Recommended Citation:
于新. 基于词典的汉藏句子自动对齐[D]. 北京. 中国科学院研究生院. 2011-06-01.
Service
Recommend this item
Sava as my favorate item
Show this item's statistics
Export Endnote File
Google Scholar
Similar articles in Google Scholar
[于新]'s Articles
CSDL cross search
Similar articles in CSDL Cross Search
[于新]‘s Articles
Related Copyright Policies
Null
Social Bookmarking
Add to CiteULike Add to Connotea Add to Del.icio.us Add to Digg Add to Reddit
所有评论 (0)
暂无评论
 
评注功能仅针对注册用户开放,请您登录
您对该条目有什么异议,请填写以下表单,管理员会尽快联系您。
内 容:
Email:  *
单位:
验证码:   刷新
您在IR的使用过程中有什么好的想法或者建议可以反馈给我们。
标 题:
 *
内 容:
Email:  *
验证码:   刷新

Items in IR are protected by copyright, with all rights reserved, unless otherwise indicated.

 

 

Valid XHTML 1.0!
Copyright © 2007-2017  中国科学院软件研究所 - Feedback
Powered by CSpace