ISCAS OpenIR  > 中科院软件所  > 中科院软件所
基于神经网络的文本分类系统NNTCS的设计和实现
刘钢
Major计算机应用技术
2003
Degree Grantor中国科学院软件研究所
Degree Level博士
Place of Degree Grantor中国科学院软件研究所
Keyword文本分类 神经网络 特征提取 潜在语义索引 遗传算法
English Abstract文本分类是文本挖掘的基础与核心,是近年来数据挖掘和网络挖掘的一个研究热点,在传统的情报检索、网站索引体系结构的建立和eb信息检索等方面占有重要地位。本文首先对当前文本分类领域几个关键问题的常用解决方法进行了研究,同时阐述了典型文本分类系统的核心技术和系统结构,对文本分类的应用范围进行了描述。然后着重介绍了一个基于神经网络的文本自动分类系统NNTCS,重点阐述了特征提取、空间降维、层次分类和分类器训练等技术的实现方法。在NNTCS中,第一步是对中文文档进行汉语分词,从文档中抽出特征词,并且统计各特征词的词频。系统使用神经网络作为分类器,特征词的词频组成原始特征向量,和神经网络输入层的神经元一一对应。在文本训练的时候,利用标记好的训练文档集进行网络训练,误差反馈算法对网络进行权值调整,得到固定的权值作为分类知识存储。而在文本分类的时候,输入待分类文档的特征向量,运行固定权值的网络,得到的输出值与闲值比较确定类别。系统中引入了信息检索中的常用技术—潜在语义索引,把原始向量空间转换到抽象的k维语义空间,实现原始向量空间的降维,提高网络训练速度和性能。神经网络在一般的模式识别中很常用,但是在文本分类中较少采用,主要原因是向量空间太庞大,网络性能受限制,而引入潜在语义索引对空间降维可以避免这种缺陷,两者相得益彰。训练过程中结合遗传算法,优化神经网络的初始权值。遗传算法有全局搜索的特点,可以避免神经网络局部收敛的问题,充分发挥遗传算法和神经网络各自的仁亡势。最后对NNTCS进行了开放性测试,实验表明NNTCS对文本分类具有较高的平均查全率和平均精度。
AbstractText classification is the basis and core of text mining, and plays an important rule in traditional information retrieval, construction of web site architecture, and search for web information. It has become a hot research project in recent years. At first the traditional solutions to some key technical problems in the field of TC are studied, also core techniques and system architecture of the typical TC systems are discussed, the applications of TC are described in this paper. Then this paper presents a text classifier based on neural networks (NNTCS) as the main topic. Some key techniques implemented in this classifier, such as feature extraction, dimension reduction, hierarchical classification and classifier training, are discussed in details. The first step in NNTCS is Chinese word segmentation on Chinese documents. Feature Terms are selected from documents. Term frequencies of each term are recorded. In NNTCS, we use artificial neural networks (ANN) as the classifier. The recorded term frequencies form the original feature vector, matching with neurons in the input layer of ANN one by one. In the stage of training, NNTCS applies labeled documents to ANN for training, and the error back propagation algorithm (BP) is employed to adjust weights of the networks. After training, the final fixed weights are saved as knowledge of classification. While in the stage of document classifying, NNTCS inputs feature vectors of the document to be classified, runs network with fixed weights, then compares the output with the predefined threshold to judge the class of the unlabelled document. NNTCS imports a traditional technique called Latent Semantic Indexing (LSI) for dimension reduction. 15/ comes from the field of Information Retrieval. It transforms the original vector space to abstract k-dimension semantic space. So the huge dimensions of the original vector space are reduced greatly, also the training speed and system performance are improved. ANN is often used in common pattern recognition systems, but rarely in TC. It's because the vector space is so huge that the performance of ANN is weakened. LS/s advantage in dimension reduction can avoid this flaw. So both ANN and LSI are improved. NNTCS employs genetic algorithm (GA) in the stage of training to optimize initial weights oi ANN. Because of GA's advantage of globally searching, it can avoid ANNs problem of local convergence. Thus the advantages of both GA and ANN are brought into play completely. Finally an open test is done on the developed system NNTCS. As experiment results show, NNTCS can reach both high precision and high recall on average.
Pages73
Language中文
Content Type学位论文
URIhttp://ir.iscas.ac.cn/handle/311060/7218
Collection中科院软件所_中科院软件所
Recommended Citation
GB/T 7714
刘钢. 基于神经网络的文本分类系统NNTCS的设计和实现[D]. 中国科学院软件研究所. 中国科学院软件研究所,2003.
Files in This Item:
File Name/Size DocType Version Access License
LW011264.pdf(2369KB) 限制开放--Application Full Text
Related Services
Recommend this item
Bookmark
Usage statistics
Export to Endnote
Google Scholar
Similar articles in Google Scholar
[刘钢]'s Articles
Baidu academic
Similar articles in Baidu academic
[刘钢]'s Articles
Bing Scholar
Similar articles in Bing Scholar
[刘钢]'s Articles
Terms of Use
No data!
Social Bookmark/Share
All comments (0)
No comment.
 

Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.