中国科学院软件研究所机构知识库
Advanced  
ISCAS OpenIR  > 软件所图书馆  > 会议论文
Title:
StreamScan: Fast scan algorithms for GPUs without global barrier synchronization
Author: Yan, Shengen (1) ; Long, Guoping (1) ; Zhang, Yunquan (1)
Conference Name: 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2013
Conference Date: February 23, 2013 - February 27, 2013
Issued Date: 2013
Conference Place: Shenzhen, China
Keyword: Scan ; prefix-sum ; OpenCL ; CUDA ; GPU ; Parallel algorithms
Corresponding Author: Yan, S.(yanshengen@gmail.com)
Publish Place: Association for Computing Machinery, General Post Office, P.O. Box 30777, NY 10087-0777, United States
Indexed Type: SCI ; EI
ISSN: 0362-1340
ISBN: 9781450319225
Department: (1) Lab. of Parallel Software and Computational Science, Institute of Software, Chinese Academy of Sciences, Beijing, China; (2) State Key Laboratory of Computing Science, Chinese Academy of Sciences, Beijing, China; (3) Graduate University, Chinese Academy of Sciences, Beijing, China
Abstract: Scan (also known as prefix sum) is a very useful primitive for various important parallel algorithms, such as sort, BFS, SpMV, compaction and so on. Current state of the art of GPU based scan implementation consists of three consecutive Reduce-Scan-Scan phases. This approach requires at least two global barriers and 3N (N is the problem size) global memory accesses. In this paper we propose StreamScan, a novel approach to implement scan on GPUs with only one computation phase. The main idea is to restrict synchronization to only adjacent workgroups, and thereby eliminating global barrier synchronization completely. The new approach requires only 2N global memory accesses and just one kernel invocation. On top of this we propose two important op-timizations to further boost performance speedups, namely thread grouping to eliminate unnecessary local barriers, and register optimization to expand the on chip problem size. We designed an auto-tuning framework to search the parameter space automatically to generate highly optimized codes for both AMD and Nvidia GPUs. We implemented our technique with OpenCL. Compared with previous fast scan implementations, experimental results not only show promising performance speedups, but also reveal dramatic different optimization tradeoffs between Nvidia and AMD GPU platforms. © 2013 ACM.
English Abstract: Scan (also known as prefix sum) is a very useful primitive for various important parallel algorithms, such as sort, BFS, SpMV, compaction and so on. Current state of the art of GPU based scan implementation consists of three consecutive Reduce-Scan-Scan phases. This approach requires at least two global barriers and 3N (N is the problem size) global memory accesses. In this paper we propose StreamScan, a novel approach to implement scan on GPUs with only one computation phase. The main idea is to restrict synchronization to only adjacent workgroups, and thereby eliminating global barrier synchronization completely. The new approach requires only 2N global memory accesses and just one kernel invocation. On top of this we propose two important op-timizations to further boost performance speedups, namely thread grouping to eliminate unnecessary local barriers, and register optimization to expand the on chip problem size. We designed an auto-tuning framework to search the parameter space automatically to generate highly optimized codes for both AMD and Nvidia GPUs. We implemented our technique with OpenCL. Compared with previous fast scan implementations, experimental results not only show promising performance speedups, but also reveal dramatic different optimization tradeoffs between Nvidia and AMD GPU platforms. © 2013 ACM.
Language: 英语
Citation statistics:
Content Type: 会议论文
URI: http://ir.iscas.ac.cn/handle/311060/16554
Appears in Collections:软件所图书馆_会议论文

Files in This Item:

There are no files associated with this item.


Recommended Citation:
Yan, Shengen ,Long, Guoping ,Zhang, Yunquan . StreamScan: Fast scan algorithms for GPUs without global barrier synchronization[C]. 见:18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2013. Shenzhen, China. February 23, 2013 - February 27, 2013.
Service
Recommend this item
Sava as my favorate item
Show this item's statistics
Export Endnote File
Google Scholar
Similar articles in Google Scholar
[Yan, Shengen (1)]'s Articles
[Long, Guoping (1)]'s Articles
[Zhang, Yunquan (1)]'s Articles
CSDL cross search
Similar articles in CSDL Cross Search
[Yan, Shengen (1)]‘s Articles
[Long, Guoping (1)]‘s Articles
[Zhang, Yunquan (1)]‘s Articles
Related Copyright Policies
Null
Social Bookmarking
Add to CiteULike Add to Connotea Add to Del.icio.us Add to Digg Add to Reddit
所有评论 (0)
暂无评论
 
评注功能仅针对注册用户开放,请您登录
您对该条目有什么异议,请填写以下表单,管理员会尽快联系您。
内 容:
Email:  *
单位:
验证码:   刷新
您在IR的使用过程中有什么好的想法或者建议可以反馈给我们。
标 题:
 *
内 容:
Email:  *
验证码:   刷新

Items in IR are protected by copyright, with all rights reserved, unless otherwise indicated.

 

 

Valid XHTML 1.0!
Copyright © 2007-2020  中国科学院软件研究所 - Feedback
Powered by CSpace