中国科学院软件研究所机构知识库
Advanced  
ISCAS OpenIR  > 中科院软件所  > 中科院软件所
题名:
基于NUTCH 的新闻服务系统的研究
作者: 丁海斌
答辩日期: 2008-06-03
授予单位: 中国科学院软件研究所
授予地点: 软件研究所
学位: 博士
关键词: 新闻服务 ; 新闻采集 ; 信息抽取 ; NUTCH
其他题名: Research of News Engine Based NUTCH
摘要: 据调查显示,新闻阅读是人们上网的一个重要目的,然而众多的新闻网站使得人们难以选择浏览自己感兴趣的新闻。新闻服务系统在很大的程度上解决了这个问题,但是这些商业新闻系统的实现却不得而知。 本文对现有的国内外新闻服务的技术和系统进行了分析和综述。在此基础上提出了使用NUTCH作为平台,通过修改NUTCH,增加插件的方式来使得NUTCH满足新闻服务的需求。本文的主要工作有: 第一,对目前的新闻服务技术,新闻服务系统做了较为全面的综述。介绍了开源系统NUTCH,对NUTCH的架构,特别是插件系统进行了系统、深入的研究。 第二,分析了新闻页面的采集方法,陈述了使用爬虫进行新闻采集的优点,进而对爬虫技术进行介绍。通过分析新闻网站的链接结构特征,提出并实现了基于正则表达式的URL过滤和基于评分的URL排序相结合的新闻采集策略。针对新闻页面的布局特征,分析了新闻主题内容的多种提取方法,设计并实现了可配置的分装器和基于DOM的两种新闻主题内容提取方法。 第三,介绍分析了个性化推荐方法,提出了一种基于新闻分类层次进行用户兴趣建模的新闻协同推荐算法。 第四,基于对NUTCH和新闻服务引擎各个模块的分析,实现了一个基于NUTCH的新闻服务系统。
英文摘要: The survey showed that online news reading is one of the most important objectives for people going online, but too many news web sites makes it difficult to read all the news. To a large extent, news service systems solved this problem. However, the realization of these information systems is unknown to the public domain. In this thesis, the existing domestic and international news services technology and systems were analyzed and reviewed. On this basis, NUTCH was used as the platform, through the revision of NUTCH especially extension of plugging, NUTCH was made to meet the requirements for news service. The work mainly included: First, survey is done for the news reading activity. A comprehensive overview was given to describe the current technology for news service, and current news service systems were discovered. NUTCH framework, especially the plug-in system is systematically and in-depth studied. Secondly, by covering the approaches for news fetching, the advantage of crawler is identified. Via analyzing the features of the link structure of the news web site, a method was proposed based on the regular expression with the URL filtering. Also, a news crawling strategy was proposed based on URL ordering according to the link structure of news web sites. By examining the features of news page layout, an algorithm based on DOM was made applicable to extract news body from the news pages along with a method based on configurable wrapper technique. Thirdly, personalized recommendation is studied, an approach based on news category hierarchy for news collaborative recommendation was proposed. Finally, based on the analysis of NUTCH and news service engine modules, a NUTCH based news service system was designed and implemented.
语种: 中文
内容类型: 学位论文
URI标识: http://ir.iscas.ac.cn/handle/311060/6490
Appears in Collections:中科院软件所

Files in This Item:

There are no files associated with this item.


Recommended Citation:
丁海斌. 基于NUTCH 的新闻服务系统的研究[D]. 软件研究所. 中国科学院软件研究所. 2008-06-03.
Service
Recommend this item
Sava as my favorate item
Show this item's statistics
Export Endnote File
Google Scholar
Similar articles in Google Scholar
[丁海斌]'s Articles
CSDL cross search
Similar articles in CSDL Cross Search
[丁海斌]‘s Articles
Related Copyright Policies
Null
Social Bookmarking
Add to CiteULike Add to Connotea Add to Del.icio.us Add to Digg Add to Reddit
所有评论 (0)
暂无评论
 
评注功能仅针对注册用户开放,请您登录
您对该条目有什么异议,请填写以下表单,管理员会尽快联系您。
内 容:
Email:  *
单位:
验证码:   刷新
您在IR的使用过程中有什么好的想法或者建议可以反馈给我们。
标 题:
 *
内 容:
Email:  *
验证码:   刷新

Items in IR are protected by copyright, with all rights reserved, unless otherwise indicated.

 

 

Valid XHTML 1.0!
Copyright © 2007-2017  中国科学院软件研究所 - Feedback
Powered by CSpace