Title:  高性能并行数值软件性能优化及存储复杂性研究 
Author:  张云泉

Issued Date:  2000

Major:  计算机软件与理论

Degree Grantor:  中国科学院中科院软件研究所

Place of Degree Grantor:  中科院软件研究所

Degree Level:  博士

Keyword:  并行数值软件包
; 性能优化
; 处理器网格
; 分块算法
; 应用程序编程接口
; 存储复杂性
; 存储层次
; 并行计算模型

Abstract:  高性能计算已经进入了万亿次机器的时代。围绕着在万亿次高性能计算环境下的并行数值软件包性能优化问题，本文以面向这类计算环境设计的有代表性的并行数值软件包ScaLAPACK为例，从最适处理器网格形状的自动选择，并行近优数据分块大小的自动选择，用户友好的ScaLAPACK应用程序编程接口，对同一算法的不同实现形式从存储复杂性角度的有效区分，和新的面向数值计算的有存储层次和指令级并行的并行计算模型等五个方面进行了研究，并提出了相应的性能优化技术和分析方法。本文在注意到处理器网格形状对各处理器上的计算量分配即负载平衡无影响的基础上，通过提出新的通信点扩展定义，提出了选择通信点集合度最小的处理器网格为性能最优的处理器网格的思想和具体方法。在日立并行计算机SR2201上对ScaLAPACK软件包中的标准性能测试程序并行LU和LLT的分析，证明了该方法完全可以正确的选择在特定处理器数目，问题规模和分块大小情况下的性能最优的处理器网格形状；在假定本地计算的最优分块大小范围可以通过实验确定的情况下，本文提出了一个把本地计算的最优分块范围大小的选择，负载平衡，通信性能优化和本地存储空间限制等因素统一在一起的近优分块大小选择框架。该框架通过给出不同因素对近优分块范围限制的不等式，在小规模实验对不同因素对近优分块选择影响匹配和分析的基础上，找出在具体平台上最关键的一个或多个因素，并通过拟合这些因素与小规模实验的结果，确定包含这些因素的近优分块选择的理论公式。通过这一框架，我们成功的在SR2201上给出了并行LU和QR算法的近优分块选择的理论公式．实验证明，在处理器数目较大时，我们给出的公式能够很好的选择近优的分块大小，一般的预测分块与实际最优分块的性能差别：对LU分解平均性能差别为2.98％；对QR分解来说，平均性能差别为7.73％．若问题规模较大时，误差更小。在前面的处理器网格形状选择技术和近优分块选择框架研究成果的基础上，结合当前正在成为研究热点的分布式共享系统和HPF语言优点的对比分析，通过对ScaLAPACK子程序调用时可通过自动选择技术确定的参数(最优处理器网格和近优分块大小)的隐藏和对矩阵数据与数据分配信息的封装，我们设计和实现了功能更适合于普通用户且调用界面友好的ScaLAPACK软件包应用程序编程接口SLAPI 1.0。目前，该软件包已完成了基本功能的设计和实现，正在进一步完善之中；怎样能够评价一个算法的具体实现存储行为的优劣，以便于定量的给出该实现形式存储行为的具体特点及可能的改进前景，成为一个急待解决的问题。本文在已有工作的基础上，提出了存储复杂性的具体定义，并给出了存储复杂性的评价尺度存储复杂度的定义和分析方法。在多个计算平台上的实验与我们的分析结果能够很好的吻合．在前面提出的存储复杂性概念基础上，观察到近几年并行计算模型的设计中，由于对存储层次的忽略而导致模型在分析涉及大量存储访问的程序分析时精确度不高的情况，本文进一步提出了面向高性能数值计算的并行计算模型DRAM(h,k)有存储层次[h]和指令级并行[k]的分布式RAM模型。在计算平台日立SR2201和PIII 500 Linux PC机群上对在该模型下分析结果的验证，证明该模型能够较好的对同一算法的不同并行实现形式进行相对性能排序． 
English Abstract:  With the more and more complicated architecture design of high performance computing systems, it becomes difficult things for common users to port their software and exploit potential performance on these systems. Thus the design and development of numerical software package became the research focus in the 70s of last 20th century. Until now, according to the targeted computing system architectures, these numerical software package can be classified into three generations: The first generation include Eispack and Linpack packages, from 1973 to 1985, which targeted to high performance vector computers; The second generation is the LAPACK package, from 1984 to 1992, which targeted to high performance vector computers, shared memory parallel computing systems and high performance computers with deep memory hierarchy; The third generation is the ScaLAPACK package, began on 1990, which targeted to distributed memory parallel computing system with message passing. With the maturation of distributed shared memory computing systems, the forth generation numerical software package will become the future research focus. However, in the current popular parallel numerical linear algebra package ScaLAPACK, the performance critical problem on the automatic selection of processor grid and block size are still unresolved. Thus when common user calls the ScaLAPACK subroutines, they are required to determine the performance suitable processor grid shape and block size. This would be a large burden for common users and thus obstacle the wide population and application of ScaLAPACK package. At the same time, there axe no much changes on the time and space complexity of those classic algorithm adopted in the three generations of numerical software packages, but after modification on the implementation of these algorithms, they can still exploit most performance from its target computing platforms. The reason is the developer of these packages adopted the implementation forms that is suitable to its target platform's memory system characteristics. The vector form for vector computers, the blocked forms algorithms for computers with memory hierarchy, and the processor grid and two dimensional block cycle data distribution forms, axe the demonstration of these considerations. But if we analyzed these different implementation forms under RAM computing model, we can find out that they are not distinguishable under this model. How to distinguish these different implementation forms of one same algorithm, becomes one urgent problem that needs to be resolved. The resolution on this problem will encourage the common user to design their algorithm implementation form that suitable for the memory hierarchy system characteristics from the just beginning of their development stage, thus will save the performance tuning cost a lot. In this dissertation, we discussed the optimization techniques and analysis methods on five aspects of ScaLAPACK performance optimization problems, i.e., automatic selection of optimal processor grid shape, the automatic selection of nearoptimal block size, user friendly application programming interface of ScaLAPACK, the memory complexity concept of different implementations of the same algorithm, and a new parallel computing model with memory hierarchy and instruction level parallelism  DRAM(h,k). In order to realize the automatic selection of optimal processor grid shape, we extended the communication point definitions [47] Based on the analysis of the communication point set in one parallel program and the observation that the change of processor grid shape doesn't affect the load balance in ScaLAPACK, we proposed the minimum degree of communication point set method that can determine the performance suitable processor grid shape for one parallel program. In this method, the parallel program with different processor grid will have different communication point set and the corresponding communication point weight, thus will have different final cost on communication. Through finding the communication point set with the minimum degree(cost), we can determine the processor grid shape with the best performance. The analysis on two ScaLAPACK parallel performance timing drivers, LU and LLT factorization, and corresponding experiments results on HITACHI SR2201, demonstrate that our method can correctly determine the optimal processor grid for specific number of processors, problem size and block size of one parallel program. The common method to select the optimal block size is based on the user's experience and large amount of experiments. However, the common user needs method that is simple and time saving, they have no such rich required experiences and are unwilling to pay for the large amount of experimental time. Since the optimal block size is dependent on the loop characteristics, memory hierarchy and their interactions, the optimal block size selection problem of LAPACK package is still unresolved. The user has to model the detailed behavior of memory access pattern and floating point pipeline in order to accurately find out the changing rules of optimal block size with the problem size. But this requirement is often unrealistic and difficult. However, this doesn't mean we can't solve the selection of optimal parallel data distribution block size problem. In the domain of parallel program, the optimal data distribution block size selection becomes the interaction among local acceptable block size range problem, load balance problem, communication optimization problem and memory size constraints, and so on. Based on the assumption that the local acceptable performance block size range can be determined through small experiments, we proposed a uniform framework that can unify all the problems mentioned. In this framework, we determine the possible range of near optimal block size through finding out all the unequals that satisfy all the requirement of mentioned problems. And then through performing small experiments on small number of processors, we can determine one or more performance critical constraints that must be satisfied. Then after curve fitting on these experimental results and unequals, we can finally give the theoretical formula for parallel near optimal data distribution block size selection. We successfully found out the theoretical selection formula for parallel LU and QR factorization of ScaLAPACK on SR2201 using our framework. The average performance difference between predicted and measured is 2.98% for LU factorization and 7.73% for QR factorization. For even larger problem size, the performance difference is even small. 
Language:  中文

Content Type:  学位论文

URI:  http://ir.iscas.ac.cn/handle/311060/6298

Appears in Collections:  中科院软件所

File Name/ File Size 
Content Type 
Version 
Access 
License 

LW002985.pdf（1170KB）      限制开放    联系获取全文 

Recommended Citation: 
张云泉. 高性能并行数值软件性能优化及存储复杂性研究[D]. 中科院软件研究所. 中国科学院中科院软件研究所. 20000101.


