中国科学院软件研究所机构知识库
Advanced  
ISCAS OpenIR  > 中科院软件所  > 中科院软件所
题名:
汉英平行语料库中名词短语对齐算法的研究
作者: 薛松
答辩日期: 2003
专业: 计算机软件与理论
授予单位: 中国科学院软件研究所
授予地点: 中国科学院软件研究所
学位: 博士
关键词: 平行语料库 ; 对齐 ; 名词短语对齐
摘要: 随着计算机和互联网的发展,在自然语言处理领域,以双语(或多语)平行语料库为基础的应用日益增多。除机器翻译方面的应用之外,平行语料库的建设对于双语词典编纂、词义消岐和跨语言信息检索也具有重要价值。在平行语料库的加工中,研究不同级别的对齐技术是一个核心课题。平行语料的对齐不仅是通过平行语料库获取一些语言知识的必要前提,也是基于实例的机器翻译中实例库构建不可缺少的关键环节。本文以平行语料库及其对齐技术在基于实例的机器翻译和机器辅助翻译中的应用为背景,讨论了《大规模汉英平行语料库》的建设,包括语料的采集、编码、句子对齐和语料库索引等问题。然后介绍了使用规则和统计相结合的方法进行汉英名词短语对齐的研究。对齐算法利用英语的句法分析器进行英语名词短语识别,采用句法模式规则过滤汉语名词短语候选集,最后通过基于共现频率的相似度计算来选取最佳配对。算法有效地克服了单纯利用规则和双语词典的方法的一些不足,提高了准确率。
英文摘要: With the development of computers and the Internet, the use of bilingual (multilingual) parallel corpus has become an important issue in the field of Natural Language Processing. Parallel corpus has valuable application in machine translation, bilingual dictionary compilation, word sense disambiguation and Cross-Lingual Information Retrieval. In the exploiture of parallel corpus, the research of alignment at different levels is an essential topic. In order to extract linguistic knowledge from parallel corpus, it is necessary to align them first. Alignment is also an important phase before Example-Based Machine Translation (EBMT) can make use of parallel corpus. This thesis firstly introduces the application of bilingual corpus and alignment in Machine (-Aided) Translation. The construction of Large-Scale Chinese-English Parallel Corpus is discussed, including resource collecting, corpus encoding, sentence alignment and concordance. Then the noun phrase alignment algorithm combining the use of rules and statistics is discussed. The algorithm uses an English parser to identify English noun phrases, and a set of syntactic patterns to filter out invalid candidates of Chinese translation correspondences. Finally the best candidate is selected as the Chinese translation of the English noun phrase by similarity measures based on co-occurrence. This method attacks the weakness of traditional pure rule-based approaches using bilingual dictionary and the accurate rate is higher.
语种: 中文
内容类型: 学位论文
URI标识: http://ir.iscas.ac.cn/handle/311060/5982
Appears in Collections:中科院软件所

Files in This Item:
File Name/ File Size Content Type Version Access License
LW011253.pdf(2811KB)----限制开放-- 联系获取全文

Recommended Citation:
薛松. 汉英平行语料库中名词短语对齐算法的研究[D]. 中国科学院软件研究所. 中国科学院软件研究所. 2003-01-01.
Service
Recommend this item
Sava as my favorate item
Show this item's statistics
Export Endnote File
Google Scholar
Similar articles in Google Scholar
[薛松]'s Articles
CSDL cross search
Similar articles in CSDL Cross Search
[薛松]‘s Articles
Related Copyright Policies
Null
Social Bookmarking
Add to CiteULike Add to Connotea Add to Del.icio.us Add to Digg Add to Reddit
所有评论 (0)
暂无评论
 
评注功能仅针对注册用户开放,请您登录
您对该条目有什么异议,请填写以下表单,管理员会尽快联系您。
内 容:
Email:  *
单位:
验证码:   刷新
您在IR的使用过程中有什么好的想法或者建议可以反馈给我们。
标 题:
 *
内 容:
Email:  *
验证码:   刷新

Items in IR are protected by copyright, with all rights reserved, unless otherwise indicated.

 

 

Valid XHTML 1.0!
Copyright © 2007-2017  中国科学院软件研究所 - Feedback
Powered by CSpace