中国科学院软件研究所机构知识库
Advanced  
ISCAS OpenIR  > 人机交互技术与智能信息处理实验室  > 学位论文
学科主题: 计算机应用 ; 计算机应用::计算机应用其他学科
题名:
基于Context_Graphs的主题爬虫系统的设计与实现
作者: 陈星
答辩日期: 2010-06-01
导师: 秦晓
专业: 计算机应用技术
授予单位: 中国科学院研究生院
授予地点: 北京
学位: 硕士
关键词: 主题爬虫 ; Context Graphs模型 ; 层次建模 ; 链接分析 ; 内容分析
摘要: 为了利用有限的硬件资源和存储空间,即时获取网络上用户最关心的数据,研究者们提出了主题爬虫。以往的通用爬虫既不考虑页面内容与主题的相关度,也不做任何预测,相比之下,主题爬虫以一定策略去评价网页的优先度,选择最为可能获取主题网页的路径进行搜索,因此有能力更快更准地获取目标网页。 主题爬虫爬行策略有很多方法,其中,Context Graphs方法是一种结合了网页的文字内容信息与Web超链结构信息的综合方法。Context Graphs方法将网络中的页面视为分层的结构,链接到达主题网页的页面根据其特征会被分到一定层次中。基于各个层次的特征,可以指导爬行器更快地去发掘可能存在的主题页面。然而,以往的Context Graphs方法在建立层次模型的时候没有区分网页不同部分文本信息的重要程度,但是在很多情况下,网页标题、超链锚文字等信息在区分网页主题时比网页正文内容更重要。此外,Context Graphs方法在指导抓取的时候不能根据新得到的主题网页更新模型,事实上,如果能利用这些新得到的网页来增量更新模型,将有可能获得更为准确的结果。基于这两点,作者提出了一种采用混合打分法以及引入模型反馈更新机制的M-Context Graphs方法,并进一步设计并实现了一个主题爬虫原型系统。 本文首先对目前已有的和正在探索中的各种主题爬行策略进行了综述和分析,并对国内外主题爬虫系统的开发现状做了简要介绍;接下来,本文详细介绍了M-Context Graphs方法中的混合打分法和模型反馈更新策略,并进一步给出了一个主题爬虫原型系统的详细设计和实现方案。最后,利用该系统通过实验比较了M-Context Graphs算法与以往算法,结果表明,M-Context Graphs方法确实获得了更好的效果。
英文摘要: In order to crawl users’ most concerned web pages within limited hardware resources and storage space, the researchers proposed focused web crawler. General crawlers don’t consider the relevance of page content and the topic, or make any prediction. In contract, focused web crawlers use certain strategies to evaluate the page priorities, and search the most preferential path firstly. Therefore, this kind of crawlers is able to acquire on-topic web pages more quickly. There are some main kinds of focused crawling strategies, and Context Graphs is one of them. Context Graphs method combines both the web page content and the web hyperlink structure, therefore it’s a composite strategies. It is significant for us to make in-depth analysis and improvement on it. Context Graphs method treats the network pages as a hierarchical structure, pages which link to on-topic pages could always be assigned to some level. The crawler could achieve better results based on the characteristics of all these levels. However, Context Graphs method treats different sections of web pages text equally when building the hierarchical structure model, but in many cases, the web page title and the anchor text of hyperlink are more important than web page content. Besides, in the crawling stage, Context Graphs method can’t update model dynamically. In fact, if the newly acquired on-topic pages are used to update model, it is possible to get more accurate results. Based on these two points, the author proposes M-Context Graphs using mixed rating method and dynamic updating mechanism. Furthermore, the author designs and realizes a prototype system. In this paper, we firstly discuss and summarize the domestic and international research on focused crawling strategies, and then make a brief introduction on several crawler applications. Next, we give some details on M-Context Graphs method, including mixed rating method and dynamic updating mechanism. According to the improved method, a design and implementation solution of the focused crawler system is also given. Finally, using this prototype system, a comparation of the M-Context Graphs method and the former methods is made. Experiment results show that the method proposed behaves better.
语种: 中文
内容类型: 学位论文
URI标识: http://ir.iscas.ac.cn/handle/311060/2326
Appears in Collections:人机交互技术与智能信息处理实验室_学位论文

Files in This Item:
File Name/ File Size Content Type Version Access License
大论文_基于Context_Graphs的主题爬虫系统的设计与实现——pdf版本.pdf(1036KB)----限制开放 联系获取全文

Recommended Citation:
陈星. 基于Context_Graphs的主题爬虫系统的设计与实现[D]. 北京. 中国科学院研究生院. 2010-06-01.
Service
Recommend this item
Sava as my favorate item
Show this item's statistics
Export Endnote File
Google Scholar
Similar articles in Google Scholar
[陈星]'s Articles
CSDL cross search
Similar articles in CSDL Cross Search
[陈星]‘s Articles
Related Copyright Policies
Null
Social Bookmarking
Add to CiteULike Add to Connotea Add to Del.icio.us Add to Digg Add to Reddit
所有评论 (0)
暂无评论
 
评注功能仅针对注册用户开放,请您登录
您对该条目有什么异议,请填写以下表单,管理员会尽快联系您。
内 容:
Email:  *
单位:
验证码:   刷新
您在IR的使用过程中有什么好的想法或者建议可以反馈给我们。
标 题:
 *
内 容:
Email:  *
验证码:   刷新

Items in IR are protected by copyright, with all rights reserved, unless otherwise indicated.

 

 

Valid XHTML 1.0!
Copyright © 2007-2017  中国科学院软件研究所 - Feedback
Powered by CSpace