中国科学院软件研究所机构知识库
Advanced  
ISCAS OpenIR  > 基础软件国家工程研究中心  > 学位论文
题名:
基于I-Match算法的垃圾邮件过滤研究
作者: 招立军
答辩日期: 2008-06-04
导师: 淮晓永
专业: 计算机应用技术
授予单位: 中国科学院研究生院
授予地点: 中国科学院软件研究所
学位: 硕士
关键词: 计算机软件,计算机软件::操作系统与操作环境
其他题名: Research on Spam Filtering Based on I-Match Algorithm
分类号: 暂无
索取号: 暂无
部门归属: 基础软件国家工程研究中心
摘要: 电子邮件(Electronic Mail,E-Mail)是目前使用最广泛的互联网应用。随着互联网络以惊人的速度增长,电子邮件成为发布恶意信息的一个重要途径,垃圾邮件已经成为危害互联网络的最大毒瘤。针对方式多样的垃圾邮件技术,垃圾邮件过滤系统往往也需要综合多种过滤技术以提高系统的有效性。其中摘要技术已经成为重要的垃圾邮件过滤方法之一:通过摘要技术判断一个邮件和已知垃圾邮件的相似度,从而对邮件进行分类。判断一个垃圾邮件过滤算法是否有效,要综合考虑算法的召回率、准确率以及时间性能。I-Match算法通过摘要值的精确匹配来判断两个邮件文本内容是否相似,算法在效率方面表现突出。但是I-Match算法在实际的应用中还存在很多问题,其中包括字典生成制约算法的性能以及面对攻击时算法表现出的鲁棒性不足。因此,优化算法的字典生成过程以及提高算法的鲁棒性成了算法应用于实际系统的两个重要问题。本文的主要工作包含以下内容: 对垃圾邮件进行相似性分析,包括垃圾邮件相似性的起因、垃圾邮件在时间和内容两方面所表现出的相似性特征。垃圾邮件体现出的相似性特征是使用摘要算法进行垃圾邮件过滤的必要条件之一。 改进I-Match算法的字典生成过程。提出利用特征的互信息作为特征选择依据改进字典生成过程,并对比几种不同的特征选择方式对算法性能的影响。 分析I-Match算法的鲁棒性以及几种I-Match改进算法对算法鲁棒性的提升,在实际的邮件语料上对各种改进算法进行评测,并综合分析各个算法的实用性。 完成了KSpam系统原型,以插件的形式综合多种邮件过滤方法,并给出了I-Match算法在KSpam系统中的实现方案。同时,系统实现了一种新式的邮件自动回收功能,有效减少邮件管理员的邮件语料收集工作。
英文摘要: E-Mail (electronic mail) is one of the most popular Internet applications. As the Internet growing in an amazing rate, the E-Mail has become a significant source of posting malicious information. The spam has become the tumor that harms the health of Internet. In order to improve the effectiveness of filter out various spam, comprehensive of many filtration technologies is required. The important one of the technologies is digest based technology which classifies the spam by using the digest based technology to compare an E-Mail with another known spam. To judge the effectiveness of a spam-filter algorithm, one needs to consider the recall rate, precision rate and time performance. Though I-Match algorithm is efficient by exact match the digest value, there are still many problems in the practical applications, including algorithm overhead brought by the lexicon-generation process and the lack of robust when facing the spam-attacking. So optimization of the lexicon-generation process and improvement of the algorithm’s robust are the important problems when putting I-Match into practice. The major contributions of this paper are:  Analyze the similarity of spam, including the causes of the similarity, time and content specific similarity features. The similarity feature is the premise of spam analysis when using digest algorithm.  Improve the lexicon-generation process of I-Match algorithm. Using mutual information of features to improve the feature selection of lexicon-generation process, and compare the performances of different algorithms.  Analyze the robustness of the I-Match algorithm as well as robustness improvement by refining the algorithm. Evaluate the effectiveness of several improved algorithm in the actual e-mail corpus sets and comprehensive analyze practicality of these algorithms.  Complete the KSpam prototype system which integrated several spam filtering methods in the form of plug-in. Implement I-Match algorithm in real system. At the same time, a new automatic email recycle system which can effectively reduce e-mail corpus collection task is presented.
内容类型: 学位论文
URI标识: http://ir.iscas.ac.cn/handle/311060/6516
Appears in Collections:基础软件国家工程研究中心_学位论文

Files in This Item:
File Name/ File Size Content Type Version Access License
10001_200528015029104招立军_paper.doc(2861KB)----限制开放-- 联系获取全文

Recommended Citation:
招立军. 基于I-Match算法的垃圾邮件过滤研究[D]. 中国科学院软件研究所. 中国科学院研究生院. 2008-06-04.
Service
Recommend this item
Sava as my favorate item
Show this item's statistics
Export Endnote File
Google Scholar
Similar articles in Google Scholar
[招立军]'s Articles
CSDL cross search
Similar articles in CSDL Cross Search
[招立军]‘s Articles
Related Copyright Policies
Null
Social Bookmarking
Add to CiteULike Add to Connotea Add to Del.icio.us Add to Digg Add to Reddit
所有评论 (0)
暂无评论
 
评注功能仅针对注册用户开放,请您登录
您对该条目有什么异议,请填写以下表单,管理员会尽快联系您。
内 容:
Email:  *
单位:
验证码:   刷新
您在IR的使用过程中有什么好的想法或者建议可以反馈给我们。
标 题:
 *
内 容:
Email:  *
验证码:   刷新

Items in IR are protected by copyright, with all rights reserved, unless otherwise indicated.

 

 

Valid XHTML 1.0!
Copyright © 2007-2017  中国科学院软件研究所 - Feedback
Powered by CSpace