中国科学院软件研究所机构知识库
Advanced  
ISCAS OpenIR  > 中科院软件所  > 中科院软件所
题名:
信息过滤模型与算法的研究
作者: 阮彤
答辩日期: 2001
专业: 计算机软件与理论
授予单位: 中国科学院软件研究所
授予地点: 中国科学院软件研究所
学位: 博士
关键词: 信息过滤 ; 领域框架 ; 贝叶斯网络 ; 特征抽取 ; 文本分类 ; 在线学习
摘要: 因特网上存储的数据与信息以前所未有的速度剧烈膨胀,带来了"信息过载"和"信息迷向"等现象,促使对信息过滤的研究成为热点.通过对该文的处理,用户行为的探测等手段,信息过滤帮助用户有效地选择合适的信息、剔除不良信息与不相关信息.在信息过滤领域,基于学习的过滤算法与特征抽取算法的研究方兴未艾,而基于过滤技术的一些产品,如信息发送系统以及防火墙中的信息阻止模块已在不断发布.基于这样的背景,该文从实际问题出发,对信息过滤从领域框架、系统模型与算法改进等多个角度进行了研究.论文构造了基于文本的信息过滤框架--IFDF(InformationFilteringDomain-SpecificFramework).分别用UML的类关系图与序列图刻画了信息过滤的静态模型与动态模型,并基于JDK1.2与JMS实现了IFDF框架.IFDF具有易理解、易维护、易扩展以及高性能等优点,为构造基于信息过滤的应用提供了基础.该文使用软件工程中的"设计模式"的思想,提出了"多算法合作模式"的概念,试图为描述复杂算法,构造新算法,部署算法中的数据资源和计算资源提供参考,最终为构造算法密集型应用提供思想方法.该文依据信息过滤应用,提出了多算法并行模式、多算法精化器模式、自反馈模式三种模式,并使用UML描述了这些模式,给出了这些模式使用的上下文.
英文摘要: In recent years, to overcome the information overloading problems of Internet, the research of Information filtering has drawn much attention. Information filtering helps to select appropriate information and eliminate useless or irrelevance information by analyzing the text or detecting users' behavior. Currently, researches and applications of information filtering are advancing simultaneously. The study over corpus-based filtering algorithms and feature selection algorithms is thriving, while products such as information-dissemination systems and content-filtering modules in firewall are released continuously. Therefore, with the background of real-word applications and practical problems, we study information filtering from the point of view of domain-specific frameworks, system models and smart algorithms. The research of information filtering involves different aspects such as corpus construction, feature selection and learning algorithms. In addition, the effectiveness and performance of an algorithm depends highly on implementation techniques such as underlying data structures. Furthermore, algorithms should adapt to differ application context. All these bring difficulties to software programming. Accordingly, we apply the theory and methods of software frameworks to the domain of information filtering for the first time. We construct an information filtering domain-specific framework (IFDF), and depict the static and dynamic model of IFDF with class diagrams and sequence diagrams of UML. We also implement IFDF with JDK1.2 and JMS. The merits of IFDF such as comprehensible, maintainable, scalable and high performance ground filtering-based applications. The representation of queries and texts are a major difficulty in information filtering. For example, the lexical relationships between words such as polysemy, synonymy, and antonym should be expressed, and the logic relationships between words in a query such as parataxis, adjunction, restrictive as well. Furthermore it is always a challenge to depict non-content properties such as quality and taste. To address the above problem, we bring forward a simplified Bayesian network named Bayesian Model of Information Filtering (BMIF). BMIF divides filtering model into four layers: Virtual feature Layer, feature Layer, Pattern Layer and Interests Layer, and it also accormodates six nodes such as Noise-And, Noise-Or, and Context, to express the relationships across events in information filtering. A series of benefits are obtained with BMIF: Content -based filtering can describe what user requires, while the advantage of collaborative-filtering lies in differentiating taste, BMIF allows us to combine content-based filtering and collaborative filtering with BMIF, and takes advantages of both merits. Meanwhile, manually specifies^ user-profiles and learned profiles are two major methods to construct user's interest. By combining user experiences and corpus-based learning, we put forward two concepts-^ -partial learning and pattern refining, which enable users to add their own experiences and observations into filtering systems. Since batch-oriented learning needs a large amount of labeled data, consuming substantive memory and disk spaces, online learning becomes a hot spot. Traditional online learning algorithms don't care about the issues of updating features when documents arrive one by one. It leads to the inconsistency between batch-oriented feature selection and online feature weights refining. Thus we provide an algorithm called Window-Based Online Feature Selection Algorithm to settle the problem. The algorithm has demonstrated performance comparative to batch-oriented method in experiment. The article also studied feature selection algorithms in Chinese information filtering, and gives references on suitable feature methods in different context. We compare several text segmentation methods, and draw paticular conclusions such as the performance of N-grams methods are comparative to Dictionary-Based Segmentation methods. Finally, it generally requires tremendous efforts to construct a new algorithm, to understand a complex algorithm and to program an algorithm into a runnable stuff. So with ideas of "Design Pattern" in software engineering, we put forward the concept of "Multi-Algorithm Cooperation Patterns" (MACPS), MACPS help to describe compound algorithms, construct new algorithms, deploy data sources and computing sources in a more structural and standard way, and at last provide standard methods to construct algorithm-intensive applications. In this article, we sum up 3 patterns from the domain of information filtering, namely MAPAP, MARAP and EDFAP. We use UML to describe these patterns, and give the context applicable to these patterns.
语种: 中文
内容类型: 学位论文
URI标识: http://ir.iscas.ac.cn/handle/311060/5896
Appears in Collections:中科院软件所

Files in This Item:
File Name/ File Size Content Type Version Access License
LW008618.pdf(2390KB)----限制开放-- 联系获取全文

Recommended Citation:
阮彤. 信息过滤模型与算法的研究[D]. 中国科学院软件研究所. 中国科学院软件研究所. 2001-01-01.
Service
Recommend this item
Sava as my favorate item
Show this item's statistics
Export Endnote File
Google Scholar
Similar articles in Google Scholar
[阮彤]'s Articles
CSDL cross search
Similar articles in CSDL Cross Search
[阮彤]‘s Articles
Related Copyright Policies
Null
Social Bookmarking
Add to CiteULike Add to Connotea Add to Del.icio.us Add to Digg Add to Reddit
所有评论 (0)
暂无评论
 
评注功能仅针对注册用户开放,请您登录
您对该条目有什么异议,请填写以下表单,管理员会尽快联系您。
内 容:
Email:  *
单位:
验证码:   刷新
您在IR的使用过程中有什么好的想法或者建议可以反馈给我们。
标 题:
 *
内 容:
Email:  *
验证码:   刷新

Items in IR are protected by copyright, with all rights reserved, unless otherwise indicated.

 

 

Valid XHTML 1.0!
Copyright © 2007-2017  中国科学院软件研究所 - Feedback
Powered by CSpace