中国科学院软件研究所机构知识库
Advanced  
ISCAS OpenIR  > 中科院软件所  > 中科院软件所
题名:
基于邻近关系的汉语及汉英跨语言文本检索研究
作者: 杜林
答辩日期: 1998
专业: 计算机软件
授予单位: 中国科学院软件研究所
授予地点: 中国科学院软件研究所
学位: 博士
关键词: 信息检索 ; 中文信息处理 ; 邻近关系模型 ; 跨语言检索
摘要: 随着计算机以及互连网络技术的不断发展,计算机系统中存储的信息以几何级数方式增长,其中绝大部分信息以文本形式出现,信息的内容覆盖不同的学科领域,如何从性质各异的大量文档中查找到特定内容的相关信息成为信息检索领域研究工作的焦点之一。同西方语言相比,汉语文本信息检索具有较大的差异。汉语中语词之间缺乏分隔标志,需要进行语词切分,从文本中识别出没有被收录的切分词典中的新语词、术语较为困难,汉语语法、评义兼类较多的特点影响句法分析和语义理解的准确性和效率。在现有的汉语全文检索系统中,一般使用n元字或词进行文本自动标引,查询和文档的匹配依赖于n元字向量或词向量之间的相似系数的大小。由于n元单字仅仅表示文档中字符串流信息,不能反映文档的内容信息,无法实现基于内容的检索。基于语词切分的汉语文本标引,无法在语词切分过程中有效解决新词识别问题,造成专有名词等重要概念的错误切分,极大地影响信息检索的准确性。另一方面,文档出现的许多重要的概念往往通过短语来表示,纯粹的语词标引降低了文本内容表示的专指性。针对语词标引专指性较低的问题,在分析现有信息检索模型和文档表示方法的基础上,根据汉语处理的实际状况,我们使用统计方法和浅层语法信息相结合的方法,进行统计短语的抽取和标引实验,期望能够进一步提高基于语词切分信息检索方法的精度,然而实验结果并没有完全证明预期的假设。基于统计短语的文本标引的查询精度,在三个不同的文档集中,提高的范围分别从6.0%到8.7%不等。统计短语具有较好的统计特征,能够在一定程度上提高查询的精度,但并不能够有效覆盖用户的查询语句,对不同的查询、不同的文档集,检索精度的改进并不一致,所提高的查询精度相对较低。汉语信息检索应用的另一个显著特点是必须具有同时处理不同领域大量信息内容的能力。受系统的效率和实用性的限制,切分词典不可能包括所有领域的语词和术语。如何处理不同领域的新词和专业术语,成为提高汉语信息检索精度的又一个重要任务。在统计短语标引研究的启发下,我们提出了基于邻近关系的汉语文本检索模型,克服上述问题,主要基于以下现象观察,被错误切分的专有名词、领域术语片段之间具有较小的领近距离,一般在文档数据库中具有较高的同现概率。与此相类似,文档数据库中的重要专业术语和短语语词之间也具有较高的文档频率和同现概率,邻近距离越小时语词间的修饰关系越紧密。邻近关系检索模型在传统语词向量空间模型的基础之上,使用语词向量成员之间的邻近距离和同现概率信息描述文档的信息内容,克服现有检索模型对新词、短语概念处理的不足,同时邻近关系标引降低一词多义对信息检索精度的影响。基于邻近关系的文本标引,克服了n元字标引和单语词标引对新词、短语处理的不足,提高了标引的专指性。实验结果显示,邻近关系信息检索模型与传统的信息检索模型相比,检索精度平均提高了20.6%。然而基于邻近关系检索模型的特征空间维数较高,文档中出现的所有邻近关系向量空间成员都参与文档的相似性比较,由于并非所有特征都有效反映文档的信息内容,影响文档相似性计算的准确性。为了进一步提高邻近关系模型特征表示的准确性,我们使用单值分解技术,将高维邻近关系空间转化为正交的低维特征语义空间,消除邻近关系的次要特征对信息检索精度的影响,进一步提高基于邻近关系模型的信息检索精度。实验结果显示,基于单值分解的邻近关系模型,虽然平均检索精度并没有较大的提高,但在召回率较低的条件下,检索精度有了较大的改善。邻近关系模型不仅能够提高汉语文本检索的精度,还能够有效消除汉英跨语言信息检索中出现的检索词翻译歧义。由于互连网络上所提供人的大量信息主要以英文形式表示,对于不能熟练使用英语准确描述所需信息的用户来讲,使用汉语查询相关的英文文档信息,极大地减轻用户的负担。汉英跨语言信息检索的关键是如何消除检索词翻译中出现的歧义。与文本机器翻译处理不同,用户检索词翻译的上下文环境较窄。常用的检索词翻译方法是根据特定应用领域平行语料中不同翻译义项的分布信息,选择概率较大的义项作为该检索词在本领域中的翻译。为了克服平行语料消歧方法对资源条件的苛刻要求, 我们使用内容相似的可比语料和邻近特征信息进插检索词翻译消歧。主要基于下列假设,汉语中具有较高邻近关系特征的检索词对在相似的英语语料中也具有较好的邻近特征。实验结果显示,基于可比语料的邻近关系消歧方法具有较好的消歧功能,汉英跨语言信息检索的精度达到31.92%。
英文摘要: With the development of the computer and Internet technology, the information stored in the computer system increases at exponential level, of which most information is represented in text form or could be converted to text easily. The text information covers different subject areas. How to access the relevant information from the heterogeneous document databases became the key issues of the information retrieval. Compared with the Western languages, Chinese text retrieval has great specialties. There are no flags between words, so it is necessary to segment the text into word sequences before text analysis and term indexing. It's very hard to recognize the new words and subject terminology at the word segmentation stage. Too many structural ambiguities reduce the accuracy and efficiency of the Chinese syntactic and semantic analysis. At present, the n-gram character indexing and word indexing are the two major indexing schemes in Chinese text retrieval system. Since the n-gram character indexing units could not represent the right concepts, they cannot be used for the content-based document retrieval. On the other hand, it's very difficult to identify all the new words, especially the proper nouns that represent important key concepts, the query precision of the word-based indexing has been effected. Also the specificity of the word-based indexing is relatively low. To overcome the low specificity problem in the word-based indexing, the statistical phrases, based on the statistical and shallow syntactic information, are extracted and used for the document indexing in order to improve the query precision further, but the experiment results only partly verify our expectations. In three different collections, the query precision s improved from 6.0% to 8.7% differently. The statistical phrases have good statistical properties and can improve the query precision by some degree. They cannot cover all the phrases that occur in the user queries. The improvements of the query precision for different document queries and different document collections are not consistent and relatively low. The information retrieval systems must be able to process large amount of documents that cover different domains. Under the restriction of the system efficiency and usability, the word segmentation dictionary could not collect all the words and terminology of different domains. How to deal with the new word and terminology problem becomes another important aspect of the Chinese information retrieval. On the basis of the statistical phrase indexing, we proposed a new proximity-based information retrieval model to overcome the new word and phrase problems in the Chinese information retrieval. Since the fragments of the wrongly segmented words and terminology have small proximity distance and high co-occurrence probabilities in the document collection, the proximity and mutual information of the term-pairs could be employed to represent the text content, which is more accurate than the traditional word-based representation. The experiment shows that the query precision of the proximity-based indexing is improved by 20.6% compared with word-based indexing. The vector dimension number of the word proximity-based indexing is very high. All the proximity relations are equally treated in the document similarity comparison, which impairs the accuracy of the proximity-based querying. In order to improve the query precision further, we employ the Singular Value Decomposition techniques to convert the high dimensional proximity vector space into low dimensional orthogonal concept space to match the documents more accurately. The evaluation results shows that the query precision improvement at 0.1 recall level is about 4.5%. The proximity-based model could not only be used for the Chinese information retrieval, but also could be employed to resolve the ambiguities of the term translation in the cross language retrieval. Since most of the information provides by the Internet is in English, it's not convenient for the Chinese users who could not express his ideas accurately in English to form the queries for the information retrieval, so the Chinese-English cross language retrieval is very helpful. the users could enter the Chinese queries to access the relevant English documents. The main issues in the cross language retrieval are how to resolve the translation ambiguities of the query terms. Different with the machine translation, the translation context of the cross language retrieval is relatively small. The commonly used methods are to employ the parallel corpus to collect the different translation probabilities of a given term and choose the most probable translation as the right translation for the queries. In order to overcome the requirement of the sparse resources of the parallel corpora, we decided to use the Chinese-English comparable corpora and proximity model to resolve the translation ambiguities in the query term translation. This is based on the following assumption: the term pairs with high proximity coefficient in Chinese corpora also have high proximity coefficient in the comparable English corpora. The query experiments show that the proximity-based algorithms ave excellent disambiguation abilities. The query precision of the Chinese-English cross language retrieval achieves 31.92%.
语种: 中文
内容类型: 学位论文
URI标识: http://ir.iscas.ac.cn/handle/311060/7170
Appears in Collections:中科院软件所

Files in This Item:
File Name/ File Size Content Type Version Access License
LW002840.pdf(1876KB)----限制开放-- 联系获取全文

Recommended Citation:
杜林. 基于邻近关系的汉语及汉英跨语言文本检索研究[D]. 中国科学院软件研究所. 中国科学院软件研究所. 1998-01-01.
Service
Recommend this item
Sava as my favorate item
Show this item's statistics
Export Endnote File
Google Scholar
Similar articles in Google Scholar
[杜林]'s Articles
CSDL cross search
Similar articles in CSDL Cross Search
[杜林]‘s Articles
Related Copyright Policies
Null
Social Bookmarking
Add to CiteULike Add to Connotea Add to Del.icio.us Add to Digg Add to Reddit
所有评论 (0)
暂无评论
 
评注功能仅针对注册用户开放,请您登录
您对该条目有什么异议,请填写以下表单,管理员会尽快联系您。
内 容:
Email:  *
单位:
验证码:   刷新
您在IR的使用过程中有什么好的想法或者建议可以反馈给我们。
标 题:
 *
内 容:
Email:  *
验证码:   刷新

Items in IR are protected by copyright, with all rights reserved, unless otherwise indicated.

 

 

Valid XHTML 1.0!
Copyright © 2007-2017  中国科学院软件研究所 - Feedback
Powered by CSpace