ETL(Extraction-Transformation-Loader)是企业内部和企业间信息资源交换和共享的关键技术。随着企业数据量的剧增, 如何提高数据处理能力和执行效率成为ETL需要解决的难题之一。提出一个基于缓存的并发ETL数据流程处理框架,该框架使用基于组件分类的缓存复用技术来 降低内存消耗和数据拷贝次数;同时使用一种并发的数据处理流程调度执行策略,该策略具有任务、流水线、数据处理多粒度并行的特点。该方法已在网驰平台ON CE DQ实现并得到验证。
English Abstract:
ETL is a key technology for information exchanging and sharing inside an enterprise or among enterprises.With the rapid increase of enterprise data volumes,it has become one of the hard problems for ETL to solve how to improve the data processing capacity and execution efficiency.The paper proposes a buffer-based parallel ETL data flow processing framework.The Framework uses component classification based buffer reusing technology to save memory consumption and decrease data copying frequency.At the mean time a parallel data processing flow scheduling execution strategy is used,which bears such characteristics as tasking,pipelining,and data processing multi-granularity paralleling.The method has been realized and validated on ONCE DQ Platform.