ISCAS OpenIR  > 软件工程技术研究开发中心
典型流式机器学习算法并行化设计与实现
沈雯婷
Major计算机软件与理论
Supervisor王伟
2018-05-22
Degree Grantor中国科学院研究生院
Degree Level硕士
Place of Degree Grantor北京
Keyword流式机器学习 增量学习 在线学习 流数据挖掘 并行化
English Abstract

     机器学习是一类从已知数据中自动分析获得规律,并利用规律对未知数据进行预测的算法,广泛应用到了各个领域。当前很多机器学习的应用场景中数据是“高速”、“动态”、“无穷”的流式数据,训练数据量大且数据分布与时间相关。因此流式环境下,机器学习算法的模型需不断更新,同时保证实时性和准确性,传统的批处理无法满足以上需求。
    增量学习技术和阶段性计算的增量式批处理技术能够适应流式环境下机器学习的需求。为了使得模型能实时更新、数据重要性随时间衰减以提高准确性,流式机器学习算法基于以上技术进行改进。一些分布式机器学习框架上并行化地实现了部分流式机器学习算法,但是现有的工作存在以下问题:(1)典型流式机器学习算法缺少并行化方案;(2)缺少流式机器学习的计算模式和数据流模型;(3)已经实现的并行化流式算法实时性、准确性不佳。
    针对现有工作的不足,本文(1)总结了流式机器学习算法的特征;(2)将流式机器学习算法分为微批式增量更新、在线增量更新和在线概要更新三种计算模式,分别对应于批处理技术、在线学习技术和流数据挖掘技术在流式环境下的改进;(3)根据计算模式中计算步骤的时序逻辑建立数学模型,并将数学模型中计算函数输入输出之间的依赖关系表达为数据流模型,包括参数增量计算流、参数更新流和模型计算流;(4)提出流式的数据、变化的参数的划分方式,数据流模型中各个计算流的并行化实现方法,总结出算法流式化和并行化设计步骤和并行化方法决策规则;(5)根据提出的三种计算模式、数据流模型、并行化方法,基于Flink分布式流处理框架,流式化和并行化设计与实现典型的流式机器学习算法。
      实验结果表明,基于本文提出的流式机器学习的三种计算模型、对应的数据流模型和并行化设计方法,批处理算法、在线学习算法和流数据挖掘算法都可以简单地在分布式环境下并行化流式实现;计算延迟在百毫秒级别,达到实时性预期;吞吐率随着计算节点的增加而增大,算法具有扩展性;并行化实现的准确性对比串行实现的损失保持在1个数量级以内。
 

Abstract

Machine learning automatically analyzes and obtains patterns from known data instances, and uses the patterns to predict unknown results. In current machine learning applications, the data instances are in a high-speed, dynamic, and infinite stream. Since the data volume is large and the data distribution is time-dependent, the model of streaming machine learning algorithm should be updated constantly while ensuring real-time and accuracy. Traditional batch processing algorithms cannot meet these requirements.
The methods of incremental learning and periodic incremental calculations of batch meet the needs in memory usage and model updating in streaming environments. To update model in real time and the reduce importance of the data over time, streaming machine learning algorithms are improved based on the above techniques. Although some streaming machine learning algorithms have been implemented in parallel on several distributed frameworks, the following problems still exist: (1) lack of parallel approaches for classical streaming machine learning algorithms, (2) lack of calculation modes and data flow models, (3) the performance of the parallel streaming algorithms that have been implemented is poor in real-time and accuracy.
Based on the deficiencies of existing works, the paper (1) summarizes the characteristics of the streaming machine learning algorithms, (2) generalizes the streaming machine learning algorithms as three modes: mini-batch incremental update, online incremental update and online sketch update, corresponding to the improving of batch processing, online learning and One-Pass techniques, (3) defines the mathematical models for calculation modes, and expresses the temporal logic and relationship between input and output as data flow models according to the calculation function in the mathematical models, (4) proposes partition methods for streaming data and changing model, proposes parallel implementation method for each computing flows and establishes steps for streaming and parallel implementation, (5) guides implementation for streaming machine algorithms based on the proposed calculation model, data flow model and parallel design rules.
The experimental results show that based on the calculation modes, data flow models and parallel rules proposed in this paper, batch algorithms and streaming algorithms can be easily implemented in parallel and streaming in distributed environment, real-time expectations are achieved because the computational delay is in hundred milliseconds, scalability performance reaches expectations when the throughput rate increases with growing computing nodes, and loss of accuracy of parallel implementation is kept within one order of magnitude compared to serial implementation.
 

Subject软件理论
Language中文
Content Type学位论文
URIhttp://ir.iscas.ac.cn/handle/311060/19041
Collection软件工程技术研究开发中心
Affiliation中国科学院软件研究所
Recommended Citation
GB/T 7714
沈雯婷. 典型流式机器学习算法并行化设计与实现[D]. 北京. 中国科学院研究生院,2018.
Files in This Item:
File Name/Size DocType Version Access License
典型流式机器学习算法并行化设计与实现.p(3422KB)学位论文 开放获取CC BY-NC-SAApplication Full Text
Related Services
Recommend this item
Bookmark
Usage statistics
Export to Endnote
Google Scholar
Similar articles in Google Scholar
[沈雯婷]'s Articles
Baidu academic
Similar articles in Baidu academic
[沈雯婷]'s Articles
Bing Scholar
Similar articles in Bing Scholar
[沈雯婷]'s Articles
Terms of Use
No data!
Social Bookmark/Share
All comments (0)
No comment.
 

Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.