中国科学院软件研究所机构知识库
Advanced  
ISCAS OpenIR  > 软件工程技术研究开发中心  > 学位论文
Subject: 计算机科学技术::计算机软件::软件理论
Title:
典型流式机器学习算法并行化设计与实现
Author: 沈雯婷
Issued Date: 2018-05-22
Supervisor: 王伟
Major: 计算机软件与理论
Degree Grantor: 中国科学院研究生院
Place of Degree Grantor: 北京
Degree Level: 硕士
Keyword: 流式机器学习 ; 增量学习 ; 在线学习 ; 流数据挖掘 ; 并行化
Abstract:

     机器学习是一类从已知数据中自动分析获得规律,并利用规律对未知数据进行预测的算法,广泛应用到了各个领域。当前很多机器学习的应用场景中数据是“高速”、“动态”、“无穷”的流式数据,训练数据量大且数据分布与时间相关。因此流式环境下,机器学习算法的模型需不断更新,同时保证实时性和准确性,传统的批处理无法满足以上需求。
    增量学习技术和阶段性计算的增量式批处理技术能够适应流式环境下机器学习的需求。为了使得模型能实时更新、数据重要性随时间衰减以提高准确性,流式机器学习算法基于以上技术进行改进。一些分布式机器学习框架上并行化地实现了部分流式机器学习算法,但是现有的工作存在以下问题:(1)典型流式机器学习算法缺少并行化方案;(2)缺少流式机器学习的计算模式和数据流模型;(3)已经实现的并行化流式算法实时性、准确性不佳。
    针对现有工作的不足,本文(1)总结了流式机器学习算法的特征;(2)将流式机器学习算法分为微批式增量更新、在线增量更新和在线概要更新三种计算模式,分别对应于批处理技术、在线学习技术和流数据挖掘技术在流式环境下的改进;(3)根据计算模式中计算步骤的时序逻辑建立数学模型,并将数学模型中计算函数输入输出之间的依赖关系表达为数据流模型,包括参数增量计算流、参数更新流和模型计算流;(4)提出流式的数据、变化的参数的划分方式,数据流模型中各个计算流的并行化实现方法,总结出算法流式化和并行化设计步骤和并行化方法决策规则;(5)根据提出的三种计算模式、数据流模型、并行化方法,基于Flink分布式流处理框架,流式化和并行化设计与实现典型的流式机器学习算法。
      实验结果表明,基于本文提出的流式机器学习的三种计算模型、对应的数据流模型和并行化设计方法,批处理算法、在线学习算法和流数据挖掘算法都可以简单地在分布式环境下并行化流式实现;计算延迟在百毫秒级别,达到实时性预期;吞吐率随着计算节点的增加而增大,算法具有扩展性;并行化实现的准确性对比串行实现的损失保持在1个数量级以内。
 

English Abstract:

Machine learning automatically analyzes and obtains patterns from known data instances, and uses the patterns to predict unknown results. In current machine learning applications, the data instances are in a high-speed, dynamic, and infinite stream. Since the data volume is large and the data distribution is time-dependent, the model of streaming machine learning algorithm should be updated constantly while ensuring real-time and accuracy. Traditional batch processing algorithms cannot meet these requirements.
The methods of incremental learning and periodic incremental calculations of batch meet the needs in memory usage and model updating in streaming environments. To update model in real time and the reduce importance of the data over time, streaming machine learning algorithms are improved based on the above techniques. Although some streaming machine learning algorithms have been implemented in parallel on several distributed frameworks, the following problems still exist: (1) lack of parallel approaches for classical streaming machine learning algorithms, (2) lack of calculation modes and data flow models, (3) the performance of the parallel streaming algorithms that have been implemented is poor in real-time and accuracy.
Based on the deficiencies of existing works, the paper (1) summarizes the characteristics of the streaming machine learning algorithms, (2) generalizes the streaming machine learning algorithms as three modes: mini-batch incremental update, online incremental update and online sketch update, corresponding to the improving of batch processing, online learning and One-Pass techniques, (3) defines the mathematical models for calculation modes, and expresses the temporal logic and relationship between input and output as data flow models according to the calculation function in the mathematical models, (4) proposes partition methods for streaming data and changing model, proposes parallel implementation method for each computing flows and establishes steps for streaming and parallel implementation, (5) guides implementation for streaming machine algorithms based on the proposed calculation model, data flow model and parallel design rules.
The experimental results show that based on the calculation modes, data flow models and parallel rules proposed in this paper, batch algorithms and streaming algorithms can be easily implemented in parallel and streaming in distributed environment, real-time expectations are achieved because the computational delay is in hundred milliseconds, scalability performance reaches expectations when the throughput rate increases with growing computing nodes, and loss of accuracy of parallel implementation is kept within one order of magnitude compared to serial implementation.
 

Language: 中文
Content Type: 学位论文
URI: http://ir.iscas.ac.cn/handle/311060/19041
Appears in Collections:软件工程技术研究开发中心 _学位论文

Files in This Item:
File Name/ File Size Content Type Version Access License
典型流式机器学习算法并行化设计与实现.pdf(3422KB)学位论文--限制开放 联系获取全文

description.institution: 中国科学院软件研究所

Recommended Citation:
沈雯婷. 典型流式机器学习算法并行化设计与实现[D]. 北京. 中国科学院研究生院. 2018-05-22.
Service
Recommend this item
Sava as my favorate item
Show this item's statistics
Export Endnote File
Google Scholar
Similar articles in Google Scholar
[沈雯婷]'s Articles
CSDL cross search
Similar articles in CSDL Cross Search
[沈雯婷]‘s Articles
Related Copyright Policies
Null
Social Bookmarking
Add to CiteULike Add to Connotea Add to Del.icio.us Add to Digg Add to Reddit
所有评论 (0)
暂无评论
 
评注功能仅针对注册用户开放,请您登录
您对该条目有什么异议,请填写以下表单,管理员会尽快联系您。
内 容:
Email:  *
单位:
验证码:   刷新
您在IR的使用过程中有什么好的想法或者建议可以反馈给我们。
标 题:
 *
内 容:
Email:  *
验证码:   刷新

Items in IR are protected by copyright, with all rights reserved, unless otherwise indicated.

 

 

Valid XHTML 1.0!
Copyright © 2007-2019  中国科学院软件研究所 - Feedback
Powered by CSpace