中国科学院软件研究所机构知识库
Advanced  
ISCAS OpenIR  > 中科院软件所  > 中科院软件所
题名:
基于神经网络的文本分类系统NNTCS的设计和实现
作者: 刘钢
答辩日期: 2003
专业: 计算机应用技术
授予单位: 中国科学院软件研究所
授予地点: 中国科学院软件研究所
学位: 博士
关键词: 文本分类 ; 神经网络 ; 特征提取 ; 潜在语义索引 ; 遗传算法
摘要: 文本分类是文本挖掘的基础与核心,是近年来数据挖掘和网络挖掘的一个研究热点,在传统的情报检索、网站索引体系结构的建立和eb信息检索等方面占有重要地位。本文首先对当前文本分类领域几个关键问题的常用解决方法进行了研究,同时阐述了典型文本分类系统的核心技术和系统结构,对文本分类的应用范围进行了描述。然后着重介绍了一个基于神经网络的文本自动分类系统NNTCS,重点阐述了特征提取、空间降维、层次分类和分类器训练等技术的实现方法。在NNTCS中,第一步是对中文文档进行汉语分词,从文档中抽出特征词,并且统计各特征词的词频。系统使用神经网络作为分类器,特征词的词频组成原始特征向量,和神经网络输入层的神经元一一对应。在文本训练的时候,利用标记好的训练文档集进行网络训练,误差反馈算法对网络进行权值调整,得到固定的权值作为分类知识存储。而在文本分类的时候,输入待分类文档的特征向量,运行固定权值的网络,得到的输出值与闲值比较确定类别。系统中引入了信息检索中的常用技术—潜在语义索引,把原始向量空间转换到抽象的k维语义空间,实现原始向量空间的降维,提高网络训练速度和性能。神经网络在一般的模式识别中很常用,但是在文本分类中较少采用,主要原因是向量空间太庞大,网络性能受限制,而引入潜在语义索引对空间降维可以避免这种缺陷,两者相得益彰。训练过程中结合遗传算法,优化神经网络的初始权值。遗传算法有全局搜索的特点,可以避免神经网络局部收敛的问题,充分发挥遗传算法和神经网络各自的仁亡势。最后对NNTCS进行了开放性测试,实验表明NNTCS对文本分类具有较高的平均查全率和平均精度。
英文摘要: Text classification is the basis and core of text mining, and plays an important rule in traditional information retrieval, construction of web site architecture, and search for web information. It has become a hot research project in recent years. At first the traditional solutions to some key technical problems in the field of TC are studied, also core techniques and system architecture of the typical TC systems are discussed, the applications of TC are described in this paper. Then this paper presents a text classifier based on neural networks (NNTCS) as the main topic. Some key techniques implemented in this classifier, such as feature extraction, dimension reduction, hierarchical classification and classifier training, are discussed in details. The first step in NNTCS is Chinese word segmentation on Chinese documents. Feature Terms are selected from documents. Term frequencies of each term are recorded. In NNTCS, we use artificial neural networks (ANN) as the classifier. The recorded term frequencies form the original feature vector, matching with neurons in the input layer of ANN one by one. In the stage of training, NNTCS applies labeled documents to ANN for training, and the error back propagation algorithm (BP) is employed to adjust weights of the networks. After training, the final fixed weights are saved as knowledge of classification. While in the stage of document classifying, NNTCS inputs feature vectors of the document to be classified, runs network with fixed weights, then compares the output with the predefined threshold to judge the class of the unlabelled document. NNTCS imports a traditional technique called Latent Semantic Indexing (LSI) for dimension reduction. 15/ comes from the field of Information Retrieval. It transforms the original vector space to abstract k-dimension semantic space. So the huge dimensions of the original vector space are reduced greatly, also the training speed and system performance are improved. ANN is often used in common pattern recognition systems, but rarely in TC. It's because the vector space is so huge that the performance of ANN is weakened. LS/s advantage in dimension reduction can avoid this flaw. So both ANN and LSI are improved. NNTCS employs genetic algorithm (GA) in the stage of training to optimize initial weights oi ANN. Because of GA's advantage of globally searching, it can avoid ANNs problem of local convergence. Thus the advantages of both GA and ANN are brought into play completely. Finally an open test is done on the developed system NNTCS. As experiment results show, NNTCS can reach both high precision and high recall on average.
语种: 中文
内容类型: 学位论文
URI标识: http://ir.iscas.ac.cn/handle/311060/7218
Appears in Collections:中科院软件所

Files in This Item:
File Name/ File Size Content Type Version Access License
LW011264.pdf(2369KB)----限制开放-- 联系获取全文

Recommended Citation:
刘钢. 基于神经网络的文本分类系统NNTCS的设计和实现[D]. 中国科学院软件研究所. 中国科学院软件研究所. 2003-01-01.
Service
Recommend this item
Sava as my favorate item
Show this item's statistics
Export Endnote File
Google Scholar
Similar articles in Google Scholar
[刘钢]'s Articles
CSDL cross search
Similar articles in CSDL Cross Search
[刘钢]‘s Articles
Related Copyright Policies
Null
Social Bookmarking
Add to CiteULike Add to Connotea Add to Del.icio.us Add to Digg Add to Reddit
所有评论 (0)
暂无评论
 
评注功能仅针对注册用户开放,请您登录
您对该条目有什么异议,请填写以下表单,管理员会尽快联系您。
内 容:
Email:  *
单位:
验证码:   刷新
您在IR的使用过程中有什么好的想法或者建议可以反馈给我们。
标 题:
 *
内 容:
Email:  *
验证码:   刷新

Items in IR are protected by copyright, with all rights reserved, unless otherwise indicated.

 

 

Valid XHTML 1.0!
Copyright © 2007-2017  中国科学院软件研究所 - Feedback
Powered by CSpace