Institutional Repository
| openblas: a high performance blas library on loongson 3a cpu | |
| Zhang Xian-Yi; Wang Qian; Zhang Yun-Quan | |
| 2011 | |
| 发表期刊 | Ruan Jian Xue Bao/Journal of Software
![]() |
| ISSN | 1000-9825 |
| 卷号 | 22期号:UPPL. 2页码:208-216 |
| 摘要 | BLAS is a fundamental math library in scientific computing. Thus, each CPU vendor releases optimized BLAS library for its own CPU. Loongson CPU series are developed by the Institute of Computing Technology, Chinese Academy of Sciences. In 2010, it released Loongson 3 CPU series. This paper introduces the open source BLAS library OpenBLAS, which is forked on GotoBLAS 2-1.13 BSD version. BLAS Level 3 functions of OpenBLAS is optimized on Loongson 3A quad cores CPU. In sequential optimizations, blocking, hand coding assembly kernel, Loongson 3A special instructions and reordering instructions are utilized. The performance of BLAS Level 3 subroutines exceeded GotoBLAS and ATLAS by about 75% and 17%. Meanwhile, it exceeded GotoBLAS and ATLAS by about 103% and 36% in double precision functions. In parallel multi-threads optimization, this study used interleaved data buffer layout to avoid shared L2 Cache conflictions among multi-threads. OpenBLAS achieved 3.47 speedups on quad cores. In 4 threads, the performance of OpenBLAS BLAS Level3 functions exceeded GotoBLAS and ATLAS by about 69% and 34%, 89% and 55% in double precision functions. ©2011 Journal of Software.; BLAS is a fundamental math library in scientific computing. Thus, each CPU vendor releases optimized BLAS library for its own CPU. Loongson CPU series are developed by the Institute of Computing Technology, Chinese Academy of Sciences. In 2010, it released Loongson 3 CPU series. This paper introduces the open source BLAS library OpenBLAS, which is forked on GotoBLAS 2-1.13 BSD version. BLAS Level 3 functions of OpenBLAS is optimized on Loongson 3A quad cores CPU. In sequential optimizations, blocking, hand coding assembly kernel, Loongson 3A special instructions and reordering instructions are utilized. The performance of BLAS Level 3 subroutines exceeded GotoBLAS and ATLAS by about 75% and 17%. Meanwhile, it exceeded GotoBLAS and ATLAS by about 103% and 36% in double precision functions. In parallel multi-threads optimization, this study used interleaved data buffer layout to avoid shared L2 Cache conflictions among multi-threads. OpenBLAS achieved 3.47 speedups on quad cores. In 4 threads, the performance of OpenBLAS BLAS Level3 functions exceeded GotoBLAS and ATLAS by about 69% and 34%, 89% and 55% in double precision functions. ©2011 Journal of Software. |
| 收录类别 | EI |
| 关键词 | Computer Software Software Engineering |
| 部门归属 | (1) Laboratory of Parallel Software and Computational Science Institute of Software Chinese Academy of Sciences Beijing 100190 China; (2) State Key Laboratory of Computing Science Chinese Academy of Sciences Beijing 100190 China; (3) Graduate University Chinese Academy of Sciences Beijing 100190 China |
| 语种 | 中文 |
| 内容类型 | 期刊论文 |
| URI标识 | http://ir.iscas.ac.cn/handle/311060/16164 |
| 专题 | 中国科学院软件研究所 |
| 推荐引用方式 GB/T 7714 | Zhang Xian-Yi,Wang Qian,Zhang Yun-Quan. openblas: a high performance blas library on loongson 3a cpu[J]. Ruan Jian Xue Bao/Journal of Software,2011,22(UPPL. 2):208-216. |
| APA | Zhang Xian-Yi,Wang Qian,&Zhang Yun-Quan.(2011).openblas: a high performance blas library on loongson 3a cpu.Ruan Jian Xue Bao/Journal of Software,22(UPPL. 2),208-216. |
| MLA | Zhang Xian-Yi,et al."openblas: a high performance blas library on loongson 3a cpu".Ruan Jian Xue Bao/Journal of Software 22.UPPL. 2(2011):208-216. |
| 条目包含的文件 | 条目无相关文件。 | |||||
除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。
修改评论