中国科学院软件研究所机构知识库
Advanced  
ISCAS OpenIR  > 软件所图书馆  > 会议论文
Title:
model-driven level 3 blas performance optimization on loongson 3a processor
Author: Zhang Xianyi ; Wang Qian ; Zhang Yunquan
Source: Proceedings of the International Conference on Parallel and Distributed Systems - ICPADS
Conference Name: 18th IEEE International Conference on Parallel and Distributed Systems, ICPADS 2012
Conference Date: December 17, 2012 - December 19, 2012
Issued Date: 2012
Conference Place: Singapore, Singapore
Keyword: Cache memory ; Computer systems ; Microprocessor chips
Indexed Type: EI
ISSN: 1521-9097
ISBN: 9780769549033
Department: (1) Lab of Parallel Software and Computational Science Institute of Software Chinese Academy of Sciences Beijing 100190 China; (2) Graduate University of Chinese Academy of Sciences Beijing 100190 China; (3) State Key Lab of Computing Science Chinese Academy of Sciences Beijing 100190 China
Abstract: Every mainstream processor vendor provides an optimized BLAS implementation for its CPU, as BLAS is a fundamental math library in scientific computing. The Loongson 3A CPU is a general-purpose 64-bit MIPS64 quad-core processor, developed by the Institute of Computing Technology, Chinese Academy of Sciences. To date, there has not been a sufficiently optimized BLAS on the Loongson 3A CPU. The purpose of this research is to optimize level 3 BLAS performance on the Loongson 3A CPU. We analyzed the Loongson 3A architecture and built a performance model to highlight the key point, L1 data cache misses, which is different from level 3 BLAS optimization on the mainstream ×86 CPU. Therefore, we employed a variety of methods to avoid L1 cache misses in single thread optimization, including cache and register blocking, the Loongson 3A 128-bit memory accessing extension instructions, software prefetching, and single precision floating-point SIMD instructions. Furthermore, we improved parallel performance by reducing bank conflicts among multiple threads in the shared L2 cache. We created an open source BLAS project, OpenBLAS, to demonstrate the performance improvement on the Loongson 3A quad-core processor. © 2012 IEEE.
English Abstract: Every mainstream processor vendor provides an optimized BLAS implementation for its CPU, as BLAS is a fundamental math library in scientific computing. The Loongson 3A CPU is a general-purpose 64-bit MIPS64 quad-core processor, developed by the Institute of Computing Technology, Chinese Academy of Sciences. To date, there has not been a sufficiently optimized BLAS on the Loongson 3A CPU. The purpose of this research is to optimize level 3 BLAS performance on the Loongson 3A CPU. We analyzed the Loongson 3A architecture and built a performance model to highlight the key point, L1 data cache misses, which is different from level 3 BLAS optimization on the mainstream ×86 CPU. Therefore, we employed a variety of methods to avoid L1 cache misses in single thread optimization, including cache and register blocking, the Loongson 3A 128-bit memory accessing extension instructions, software prefetching, and single precision floating-point SIMD instructions. Furthermore, we improved parallel performance by reducing bank conflicts among multiple threads in the shared L2 cache. We created an open source BLAS project, OpenBLAS, to demonstrate the performance improvement on the Loongson 3A quad-core processor. © 2012 IEEE.
Language: 英语
Content Type: 会议论文
URI: http://ir.iscas.ac.cn/handle/311060/15910
Appears in Collections:软件所图书馆_会议论文

Files in This Item:

There are no files associated with this item.


Recommended Citation:
Zhang Xianyi,Wang Qian,Zhang Yunquan. model-driven level 3 blas performance optimization on loongson 3a processor[C]. 见:18th IEEE International Conference on Parallel and Distributed Systems, ICPADS 2012. Singapore, Singapore. December 17, 2012 - December 19, 2012.
Service
Recommend this item
Sava as my favorate item
Show this item's statistics
Export Endnote File
Google Scholar
Similar articles in Google Scholar
[Zhang Xianyi]'s Articles
[Wang Qian]'s Articles
[Zhang Yunquan]'s Articles
CSDL cross search
Similar articles in CSDL Cross Search
[Zhang Xianyi]‘s Articles
[Wang Qian]‘s Articles
[Zhang Yunquan]‘s Articles
Related Copyright Policies
Null
Social Bookmarking
Add to CiteULike Add to Connotea Add to Del.icio.us Add to Digg Add to Reddit
所有评论 (0)
暂无评论
 
评注功能仅针对注册用户开放,请您登录
您对该条目有什么异议,请填写以下表单,管理员会尽快联系您。
内 容:
Email:  *
单位:
验证码:   刷新
您在IR的使用过程中有什么好的想法或者建议可以反馈给我们。
标 题:
 *
内 容:
Email:  *
验证码:   刷新

Items in IR are protected by copyright, with all rights reserved, unless otherwise indicated.

 

 

Valid XHTML 1.0!
Copyright © 2007-2020  中国科学院软件研究所 - Feedback
Powered by CSpace