ISCAS OpenIR
openblas: a high performance blas library on loongson 3a cpu
Zhang Xian-Yi; Wang Qian; Zhang Yun-Quan
2011
发表期刊Ruan Jian Xue Bao/Journal of Software
ISSN1000-9825
卷号22期号:UPPL. 2页码:208-216
摘要BLAS is a fundamental math library in scientific computing. Thus, each CPU vendor releases optimized BLAS library for its own CPU. Loongson CPU series are developed by the Institute of Computing Technology, Chinese Academy of Sciences. In 2010, it released Loongson 3 CPU series. This paper introduces the open source BLAS library OpenBLAS, which is forked on GotoBLAS 2-1.13 BSD version. BLAS Level 3 functions of OpenBLAS is optimized on Loongson 3A quad cores CPU. In sequential optimizations, blocking, hand coding assembly kernel, Loongson 3A special instructions and reordering instructions are utilized. The performance of BLAS Level 3 subroutines exceeded GotoBLAS and ATLAS by about 75% and 17%. Meanwhile, it exceeded GotoBLAS and ATLAS by about 103% and 36% in double precision functions. In parallel multi-threads optimization, this study used interleaved data buffer layout to avoid shared L2 Cache conflictions among multi-threads. OpenBLAS achieved 3.47 speedups on quad cores. In 4 threads, the performance of OpenBLAS BLAS Level3 functions exceeded GotoBLAS and ATLAS by about 69% and 34%, 89% and 55% in double precision functions. ©2011 Journal of Software.; BLAS is a fundamental math library in scientific computing. Thus, each CPU vendor releases optimized BLAS library for its own CPU. Loongson CPU series are developed by the Institute of Computing Technology, Chinese Academy of Sciences. In 2010, it released Loongson 3 CPU series. This paper introduces the open source BLAS library OpenBLAS, which is forked on GotoBLAS 2-1.13 BSD version. BLAS Level 3 functions of OpenBLAS is optimized on Loongson 3A quad cores CPU. In sequential optimizations, blocking, hand coding assembly kernel, Loongson 3A special instructions and reordering instructions are utilized. The performance of BLAS Level 3 subroutines exceeded GotoBLAS and ATLAS by about 75% and 17%. Meanwhile, it exceeded GotoBLAS and ATLAS by about 103% and 36% in double precision functions. In parallel multi-threads optimization, this study used interleaved data buffer layout to avoid shared L2 Cache conflictions among multi-threads. OpenBLAS achieved 3.47 speedups on quad cores. In 4 threads, the performance of OpenBLAS BLAS Level3 functions exceeded GotoBLAS and ATLAS by about 69% and 34%, 89% and 55% in double precision functions. ©2011 Journal of Software.
收录类别EI
关键词Computer Software Software Engineering
部门归属(1) Laboratory of Parallel Software and Computational Science Institute of Software Chinese Academy of Sciences Beijing 100190 China; (2) State Key Laboratory of Computing Science Chinese Academy of Sciences Beijing 100190 China; (3) Graduate University Chinese Academy of Sciences Beijing 100190 China
语种中文
内容类型期刊论文
URI标识http://ir.iscas.ac.cn/handle/311060/16164
专题中国科学院软件研究所
推荐引用方式
GB/T 7714
Zhang Xian-Yi,Wang Qian,Zhang Yun-Quan. openblas: a high performance blas library on loongson 3a cpu[J]. Ruan Jian Xue Bao/Journal of Software,2011,22(UPPL. 2):208-216.
APA Zhang Xian-Yi,Wang Qian,&Zhang Yun-Quan.(2011).openblas: a high performance blas library on loongson 3a cpu.Ruan Jian Xue Bao/Journal of Software,22(UPPL. 2),208-216.
MLA Zhang Xian-Yi,et al."openblas: a high performance blas library on loongson 3a cpu".Ruan Jian Xue Bao/Journal of Software 22.UPPL. 2(2011):208-216.
条目包含的文件
条目无相关文件。
个性服务
推荐该条目
保存到收藏夹
查看访问统计
导出为Endnote文件
谷歌学术
谷歌学术中相似的文章
[Zhang Xian-Yi]的文章
[Wang Qian]的文章
[Zhang Yun-Quan]的文章
百度学术
百度学术中相似的文章
[Zhang Xian-Yi]的文章
[Wang Qian]的文章
[Zhang Yun-Quan]的文章
必应学术
必应学术中相似的文章
[Zhang Xian-Yi]的文章
[Wang Qian]的文章
[Zhang Yun-Quan]的文章
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。