汉英平行语料库中名词短语对齐算法的研究

ISCAS OpenIR > 中科院软件所 > 中科院软件所

	汉英平行语料库中名词短语对齐算法的研究
	薛松
专业	计算机软件与理论
	2003
学位授予单位	中国科学院软件研究所
学位	博士
学位授予地点	中国科学院软件研究所
关键词	平行语料库对齐名词短语对齐
摘要	随着计算机和互联网的发展，在自然语言处理领域，以双语（或多语）平行语料库为基础的应用日益增多。除机器翻译方面的应用之外，平行语料库的建设对于双语词典编纂、词义消岐和跨语言信息检索也具有重要价值。在平行语料库的加工中，研究不同级别的对齐技术是一个核心课题。平行语料的对齐不仅是通过平行语料库获取一些语言知识的必要前提，也是基于实例的机器翻译中实例库构建不可缺少的关键环节。本文以平行语料库及其对齐技术在基于实例的机器翻译和机器辅助翻译中的应用为背景，讨论了《大规模汉英平行语料库》的建设，包括语料的采集、编码、句子对齐和语料库索引等问题。然后介绍了使用规则和统计相结合的方法进行汉英名词短语对齐的研究。对齐算法利用英语的句法分析器进行英语名词短语识别，采用句法模式规则过滤汉语名词短语候选集，最后通过基于共现频率的相似度计算来选取最佳配对。算法有效地克服了单纯利用规则和双语词典的方法的一些不足，提高了准确率。
其他摘要	With the development of computers and the Internet, the use of bilingual (multilingual) parallel corpus has become an important issue in the field of Natural Language Processing. Parallel corpus has valuable application in machine translation, bilingual dictionary compilation, word sense disambiguation and Cross-Lingual Information Retrieval. In the exploiture of parallel corpus, the research of alignment at different levels is an essential topic. In order to extract linguistic knowledge from parallel corpus, it is necessary to align them first. Alignment is also an important phase before Example-Based Machine Translation (EBMT) can make use of parallel corpus. This thesis firstly introduces the application of bilingual corpus and alignment in Machine (-Aided) Translation. The construction of Large-Scale Chinese-English Parallel Corpus is discussed, including resource collecting, corpus encoding, sentence alignment and concordance. Then the noun phrase alignment algorithm combining the use of rules and statistics is discussed. The algorithm uses an English parser to identify English noun phrases, and a set of syntactic patterns to filter out invalid candidates of Chinese translation correspondences. Finally the best candidate is selected as the Chinese translation of the English noun phrase by similarity measures based on co-occurrence. This method attacks the weakness of traditional pure rule-based approaches using bilingual dictionary and the accurate rate is higher.
页数	55
语种	中文
内容类型	学位论文
URI标识	http://ir.iscas.ac.cn/handle/311060/5982
专题	中科院软件所_中科院软件所
推荐引用方式 GB/T 7714	薛松. 汉英平行语料库中名词短语对齐算法的研究[D]. 中国科学院软件研究所. 中国科学院软件研究所,2003.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
LW011253.pdf（2811KB）			限制开放	--	请求全文