ISCAS OpenIR
model-driven level 3 blas performance optimization on loongson 3a processor
Zhang Xianyi; Wang Qian; Zhang Yunquan
2012
会议名称18th IEEE International Conference on Parallel and Distributed Systems, ICPADS 2012
会议录名称Proceedings of the International Conference on Parallel and Distributed Systems - ICPADS
页码684-691
会议日期December 17, 2012 - December 19, 2012
会议地点Singapore, Singapore
收录类别EI
ISSN1521-9097
ISBN9780769549033
部门归属(1) Lab of Parallel Software and Computational Science Institute of Software Chinese Academy of Sciences Beijing 100190 China; (2) Graduate University of Chinese Academy of Sciences Beijing 100190 China; (3) State Key Lab of Computing Science Chinese Academy of Sciences Beijing 100190 China
摘要Every mainstream processor vendor provides an optimized BLAS implementation for its CPU, as BLAS is a fundamental math library in scientific computing. The Loongson 3A CPU is a general-purpose 64-bit MIPS64 quad-core processor, developed by the Institute of Computing Technology, Chinese Academy of Sciences. To date, there has not been a sufficiently optimized BLAS on the Loongson 3A CPU. The purpose of this research is to optimize level 3 BLAS performance on the Loongson 3A CPU. We analyzed the Loongson 3A architecture and built a performance model to highlight the key point, L1 data cache misses, which is different from level 3 BLAS optimization on the mainstream ×86 CPU. Therefore, we employed a variety of methods to avoid L1 cache misses in single thread optimization, including cache and register blocking, the Loongson 3A 128-bit memory accessing extension instructions, software prefetching, and single precision floating-point SIMD instructions. Furthermore, we improved parallel performance by reducing bank conflicts among multiple threads in the shared L2 cache. We created an open source BLAS project, OpenBLAS, to demonstrate the performance improvement on the Loongson 3A quad-core processor. © 2012 IEEE.; Every mainstream processor vendor provides an optimized BLAS implementation for its CPU, as BLAS is a fundamental math library in scientific computing. The Loongson 3A CPU is a general-purpose 64-bit MIPS64 quad-core processor, developed by the Institute of Computing Technology, Chinese Academy of Sciences. To date, there has not been a sufficiently optimized BLAS on the Loongson 3A CPU. The purpose of this research is to optimize level 3 BLAS performance on the Loongson 3A CPU. We analyzed the Loongson 3A architecture and built a performance model to highlight the key point, L1 data cache misses, which is different from level 3 BLAS optimization on the mainstream ×86 CPU. Therefore, we employed a variety of methods to avoid L1 cache misses in single thread optimization, including cache and register blocking, the Loongson 3A 128-bit memory accessing extension instructions, software prefetching, and single precision floating-point SIMD instructions. Furthermore, we improved parallel performance by reducing bank conflicts among multiple threads in the shared L2 cache. We created an open source BLAS project, OpenBLAS, to demonstrate the performance improvement on the Loongson 3A quad-core processor. © 2012 IEEE.
关键词Cache Memory Computer Systems Microprocessor Chips
语种英语
内容类型会议论文
URI标识http://ir.iscas.ac.cn/handle/311060/15910
专题中国科学院软件研究所
推荐引用方式
GB/T 7714
Zhang Xianyi,Wang Qian,Zhang Yunquan. model-driven level 3 blas performance optimization on loongson 3a processor[C],2012:684-691.
条目包含的文件
条目无相关文件。
个性服务
推荐该条目
保存到收藏夹
查看访问统计
导出为Endnote文件
谷歌学术
谷歌学术中相似的文章
[Zhang Xianyi]的文章
[Wang Qian]的文章
[Zhang Yunquan]的文章
百度学术
百度学术中相似的文章
[Zhang Xianyi]的文章
[Wang Qian]的文章
[Zhang Yunquan]的文章
必应学术
必应学术中相似的文章
[Zhang Xianyi]的文章
[Wang Qian]的文章
[Zhang Yunquan]的文章
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。