Institutional Repository
| Optimizing and scaling HPCG on tianhe-2: Early experience | |
| Zhang, Xianyi (1); Yang, Chao (1); Liu, Fangfang (1); Liu, Yiqun (1); Lu, Yutong (4) | |
| 2014 | |
| Conference Name | 14th International Conference on Algorithms and Architectures for Parallel Processing, ICA3PP 2014 |
| Pages | 28-41 |
| Conference Date | August 24, 2014 - August 27, 2014 |
| Conference Place | Dalian, China |
| Indexed Type | EI |
| Publish Place | Springer Verlag |
| ISSN | 3029743 |
| ISBN | 9783319111964 |
| Department | (1) Institute of Software, Chinese Academy of Sciences, Beijing 100190, China; (2) State Key Laboratory of Computer Science, Chinese Academy of Sciences, Beijing 100190, China; (3) University of Chinese Academy of Sciences, Beijing 100049, China; (4) National University of Defense Technology, Changsha Hunan 410073, China |
| English Abstract | In this paper, a first attempt has been made on optimizing and scaling HPCG on the world's largest supercomputer, Tianhe-2. This early work focuses on the optimization of the CPU code without using the Intel Xeon Phi coprocessors. In our work, we reformulate the basic CG algorithm to minimize the cost of collective communication and employ several optimizing techniques such as SIMDization, loop unrolling, forward and backward sweep fusion, OpenMP parallization to further enhance the performance of kernels such as the sparse matrix vector multiplication, the symmetric Gauss-Seidel relaxation and the geometric multigrid v-cycle. We successfully scale the HPCG code from 256 up to 6,144 nodes (147,456 CPU cores) on Tianhe-2, with a nearly ideal weak scalability and an aggregate performance of 79.83 Tflops, which is 6.38X higher than the reference implementation. © 2014 Springer International Publishing Switzerland.; In this paper, a first attempt has been made on optimizing and scaling HPCG on the world's largest supercomputer, Tianhe-2. This early work focuses on the optimization of the CPU code without using the Intel Xeon Phi coprocessors. In our work, we reformulate the basic CG algorithm to minimize the cost of collective communication and employ several optimizing techniques such as SIMDization, loop unrolling, forward and backward sweep fusion, OpenMP parallization to further enhance the performance of kernels such as the sparse matrix vector multiplication, the symmetric Gauss-Seidel relaxation and the geometric multigrid v-cycle. We successfully scale the HPCG code from 256 up to 6,144 nodes (147,456 CPU cores) on Tianhe-2, with a nearly ideal weak scalability and an aggregate performance of 79.83 Tflops, which is 6.38X higher than the reference implementation. © 2014 Springer International Publishing Switzerland. |
| Language | 英语 |
| Content Type | 会议论文 |
| URI | http://ir.iscas.ac.cn/handle/311060/16618 |
| Collection | 中国科学院软件研究所 |
| Recommended Citation GB/T 7714 | Zhang, Xianyi ,Yang, Chao ,Liu, Fangfang ,et al. Optimizing and scaling HPCG on tianhe-2: Early experience[C]. Springer Verlag,2014:28-41. |
| Files in This Item: | There are no files associated with this item. | |||||
Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.
Edit Comment