中国科学院软件研究所机构知识库
Advanced  
ISCAS OpenIR  > 中科院软件所  > 中科院软件所
题名:
数据挖掘在糖尿病数据中的应用研究
作者: 成华
答辩日期: 2003
专业: 计算机应用
授予单位: 中国科学院软件研究所
授予地点: 中国科学院软件研究所
学位: 博士
关键词: 数据挖掘 ; 知识发现 ; II型糖尿病 ; 生物信息学 ; 决策树
摘要: 随着社会经济的进步与发展,威胁人类健康的疾病谱正在变化。慢性非传染性疾病对国民、尤其是老年人群的健康的威胁越来越大,最为明显的是2型糖尿病在全球范围内的几近流行的上升趋势。2型糖尿病的隐匿起病、患病率随年龄增大而增加等特点表明这一疾病发展为一渐进型糖尿病,对发病进行控制首先涉及对这一疾病的了解。了解这一疾病发生发展的规律,针对重要途径、重点人群入手,才能达到理想的预防和诊断效果。探讨2型糖尿病发病规律是控制其发生的重要、基础的一步。对一控制糖尿病这类非传染性疾病的传统研究用的是线性还原式的研究方法,但这种方法现在看来,存在着相当的局限性。针对传统研究的局限性,本文尝试用生物信息学的思想来发现规律。数据挖掘是生物信息学的一个主要技术,本文将数据挖掘算法引入2型糖尿病的发病规律研究中,希望从大量的糖尿病数据中学习2型糖尿病的发病规律的知识,挖掘出我们需要的数据和规则,构造糖尿病的分类和预测系统。糖尿病的原始数据来源于对患者和随机人群的健康调查。健康调查表的内容经过适当的转换存入数据库作为原始数据。针对这些数据的不完整性、有噪声性和不一致性,本文用数据清理、数据变换、数据归约等数据挖掘的预处理技术处理这些原始数据。本文数据挖掘的任务是从众多的糖尿病数据中挖掘出致病规律,为糖尿病的预防,诊断和预测构造决策系统。由挖掘任务的分类性和对挖掘算法的要求,本文选用了决策树方法来进行数据挖掘。同时,由于糖尿病数据的连续性,选用了决策树方法中的C4.5算法作为数据挖掘算法。在实现C4.5算法的基础上,本文在糖尿病数据上进行了发病规律的学习和规则提取,在预处理过的糖尿病数据的基础上生成了一组糖尿病的诊断与预测规则。并用分类法的保持评判方法评判其正确率。针对上述决策树对患病人群的正确率不够理想的情况,提出了以训练集的比例因子变量为参数,试验了患病人群的正确率和正确率均值随比例因子变量的变化关系,给出了一个经过改进的分类器和判定患病人群的解决方案。
英文摘要: With the progress and development of the socio-economic level, the disease tree, which threatens human health, is changing. The chronic non-infectious disease is becoming the primary disease that threatens human being, especially for aged people. The most significant example is type 2 Diabetes Mellitus cases that are increased rapidly recently in the world. The growing of age increases the incubation period and illness rate of type 2 Diabetes Mellitus cases, which indicates that this disease is a kind of progressive Diabetes Mellitus. The best way to get a better control of this disease is to understand and investigate it comprehensively. There are some approaches that help to have a perfect treatment for type 2 Diabetes Mellitus - to understand the pathogeny and pathologize of this disease, to take control of the routes of infection and to know well of the key illness group. Discovering the pathogeny and pathologize of type 2 Diabetes Mellitus is the most important and fundamental step to control the morbidity. The traditional research to control the non-infectious disease, such as Diabetes Mellitus, was used to be the Linear Reduction method. However, from the point of view in nowadays, this method is with some considerable restrictions. For the restrictions of the traditional research, we try to use Bioinformatics method to discovery the regularity in this article. As a main technology in Bioinformatics, in this article we introduce Data Mining in the pathogeny research of type 2 Diabetes Mellitus, in order to obtain the knowledge of type 2 Diabetes Mellitus pathogeny, to discover the required data and rules, and then to structure the classification and prediction system of Diabetes Mellitus. The source data of Diabetes Mellitus originates from the health examination reports on patients and random sampling. With the appropriate transforming the data in the health examination reports and storing the data in the database, we can get the source data. For the incompleteness, noisiness and inconsistency in these data, we use some preprocess technologies of Data Mining, such as Data Cleaning, Data Transformation and Data Reduction, to process these source data. The Data Mining task is to find the illness regularity from huge Diabetes Mellitus data, to organize the decision system to prevent, diagnose and predict the Diabetes Mellitus. Depending on the mining-mission classification and the mining algorithm requirement, we choose Decision Tree method to do data mining. Also, for the continuity in Diabetes Mellitus data, choosing the C4.5 algorithm in Decision Tree method to be the data mining algorithm. On the basis of the implement of C4.5 algorithm, we learn the knowledge of illness regularity and rules from Diabetes Mellitus data, and generate a set of rules of Diabetes Mellitus diagnostics and prediction depending on the preprocessed Diabetes Mellitus data. In addition, with the holdout method in classification to determine the accurate rate. Because the accurate rate of illness group derived from the Decision Tree above is not precise enough, we bring up the ratio of the training set as the parameters to test the accurate rate of illness group and test the variation associated with the average accurate rate varied from the ratio. From this method, we can provide an improved classifier and a best solution to determine the illness group.
语种: 中文
内容类型: 学位论文
URI标识: http://ir.iscas.ac.cn/handle/311060/6926
Appears in Collections:中科院软件所

Files in This Item:
File Name/ File Size Content Type Version Access License
LW011213.pdf(3270KB)----限制开放-- 联系获取全文

Recommended Citation:
成华. 数据挖掘在糖尿病数据中的应用研究[D]. 中国科学院软件研究所. 中国科学院软件研究所. 2003-01-01.
Service
Recommend this item
Sava as my favorate item
Show this item's statistics
Export Endnote File
Google Scholar
Similar articles in Google Scholar
[成华]'s Articles
CSDL cross search
Similar articles in CSDL Cross Search
[成华]‘s Articles
Related Copyright Policies
Null
Social Bookmarking
Add to CiteULike Add to Connotea Add to Del.icio.us Add to Digg Add to Reddit
所有评论 (0)
暂无评论
 
评注功能仅针对注册用户开放,请您登录
您对该条目有什么异议,请填写以下表单,管理员会尽快联系您。
内 容:
Email:  *
单位:
验证码:   刷新
您在IR的使用过程中有什么好的想法或者建议可以反馈给我们。
标 题:
 *
内 容:
Email:  *
验证码:   刷新

Items in IR are protected by copyright, with all rights reserved, unless otherwise indicated.

 

 

Valid XHTML 1.0!
Copyright © 2007-2017  中国科学院软件研究所 - Feedback
Powered by CSpace