Institutional Repository
| YaSpMV: Yet another SpMV framework on GPUs; YaSpMV: Yet another SpMV framework on GPUs | |
| Yan, Shengen (1); Li, Chao (1); Zhang, Yunquan (2); Zhou, Huiyang (1) | |
| 2014 ; 2014 | |
| Conference Name | 2014 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2014 ; 2014 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2014 |
| Pages | 107-118 ; 107-118 |
| Conference Date | February 15, 2014 - February 19, 2014 ; February 15, 2014 - February 19, 2014 |
| Conference Place | Orlando, FL, United states ; Orlando, FL, United states |
| Indexed Type | EI ; EI |
| Publish Place | Association for Computing Machinery ; Association for Computing Machinery |
| ISBN | 9781450326568 ; 9781450326568 |
| Department | (1) Institute of Software, China North Carolina State University, Chinese Academy of Sciences, Raleigh, NC, United States; (2) State Key Lab of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China ; (1) Institute of Software, China North Carolina State University, Chinese Academy of Sciences, Raleigh, NC, United States; (2) State Key Lab of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China |
| English Abstract | SpMV is a key linear algebra algorithm and has been widely used in many important application domains. As a result, numerous attempts have been made to optimize SpMV on GPUs to leverage their massive computational throughput. Although the previous work has shown impressive progress, load imbalance and high memory bandwidth remain the critical performance bottlenecks for SpMV. In this paper, we present our novel solutions to these problems. First, we devise a new SpMV format, called blocked compressed common coordinate (BCCOO), which uses bit flags to store the row indices in a blocked common coordinate (COO) format so as to alleviate the bandwidth problem. We further improve this format by partitioning the matrix into vertical slices to enhance the cache hit rates when accessing the vector to be multiplied. Second, we revisit the segmented scan approach for SpMV to address the load imbalance problem. We propose a highly efficient matrix-based segmented sum/scan for SpMV and further improve it by eliminating global synchronization. Then, we introduce an auto-tuning framework to choose optimization parameters based on the characteristics of input sparse matrices and target hardware platforms. Our experimental results on GTX680 GPUs and GTX480 GPUs show that our proposed framework achieves significant performance improvement over the vendor tuned CUSPARSE V5.0 (up to 229% and 65% on average on GTX680 GPUs, up to 150% and 42% on average on GTX480 GPUs) and some most recently proposed schemes (e.g., up to 195% and 70% on average over clSpMV on GTX680 GPUs, up to 162% and 40% on average over clSpMV on GTX480 GPUs). Copyright © 2014 ACM.; SpMV is a key linear algebra algorithm and has been widely used in many important application domains. As a result, numerous attempts have been made to optimize SpMV on GPUs to leverage their massive computational throughput. Although the previous work has shown impressive progress, load imbalance and high memory bandwidth remain the critical performance bottlenecks for SpMV. In this paper, we present our novel solutions to these problems. First, we devise a new SpMV format, called blocked compressed common coordinate (BCCOO), which uses bit flags to store the row indices in a blocked common coordinate (COO) format so as to alleviate the bandwidth problem. We further improve this format by partitioning the matrix into vertical slices to enhance the cache hit rates when accessing the vector to be multiplied. Second, we revisit the segmented scan approach for SpMV to address the load imbalance problem. We propose a highly efficient matrix-based segmented sum/scan for SpMV and further improve it by eliminating global synchronization. Then, we introduce an auto-tuning framework to choose optimization parameters based on the characteristics of input sparse matrices and target hardware platforms. Our experimental results on GTX680 GPUs and GTX480 GPUs show that our proposed framework achieves significant performance improvement over the vendor tuned CUSPARSE V5.0 (up to 229% and 65% on average on GTX680 GPUs, up to 150% and 42% on average on GTX480 GPUs) and some most recently proposed schemes (e.g., up to 195% and 70% on average over clSpMV on GTX680 GPUs, up to 162% and 40% on average over clSpMV on GTX480 GPUs). Copyright © 2014 ACM.; SpMV is a key linear algebra algorithm and has been widely used in many important application domains. As a result, numerous attempts have been made to optimize SpMV on GPUs to leverage their massive computational throughput. Although the previous work has shown impressive progress, load imbalance and high memory bandwidth remain the critical performance bottlenecks for SpMV. In this paper, we present our novel solutions to these problems. First, we devise a new SpMV format, called blocked compressed common coordinate (BCCOO), which uses bit flags to store the row indices in a blocked common coordinate (COO) format so as to alleviate the bandwidth problem. We further improve this format by partitioning the matrix into vertical slices to enhance the cache hit rates when accessing the vector to be multiplied. Second, we revisit the segmented scan approach for SpMV to address the load imbalance problem. We propose a highly efficient matrix-based segmented sum/scan for SpMV and further improve it by eliminating global synchronization. Then, we introduce an auto-tuning framework to choose optimization parameters based on the characteristics of input sparse matrices and target hardware platforms. Our experimental results on GTX680 GPUs and GTX480 GPUs show that our proposed framework achieves significant performance improvement over the vendor tuned CUSPARSE V5.0 (up to 229% and 65% on average on GTX680 GPUs, up to 150% and 42% on average on GTX480 GPUs) and some most recently proposed schemes (e.g., up to 195% and 70% on average over clSpMV on GTX680 GPUs, up to 162% and 40% on average over clSpMV on GTX480 GPUs). Copyright © 2014 ACM.; SpMV is a key linear algebra algorithm and has been widely used in many important application domains. As a result, numerous attempts have been made to optimize SpMV on GPUs to leverage their massive computational throughput. Although the previous work has shown impressive progress, load imbalance and high memory bandwidth remain the critical performance bottlenecks for SpMV. In this paper, we present our novel solutions to these problems. First, we devise a new SpMV format, called blocked compressed common coordinate (BCCOO), which uses bit flags to store the row indices in a blocked common coordinate (COO) format so as to alleviate the bandwidth problem. We further improve this format by partitioning the matrix into vertical slices to enhance the cache hit rates when accessing the vector to be multiplied. Second, we revisit the segmented scan approach for SpMV to address the load imbalance problem. We propose a highly efficient matrix-based segmented sum/scan for SpMV and further improve it by eliminating global synchronization. Then, we introduce an auto-tuning framework to choose optimization parameters based on the characteristics of input sparse matrices and target hardware platforms. Our experimental results on GTX680 GPUs and GTX480 GPUs show that our proposed framework achieves significant performance improvement over the vendor tuned CUSPARSE V5.0 (up to 229% and 65% on average on GTX680 GPUs, up to 150% and 42% on average on GTX480 GPUs) and some most recently proposed schemes (e.g., up to 195% and 70% on average over clSpMV on GTX680 GPUs, up to 162% and 40% on average over clSpMV on GTX480 GPUs). Copyright © 2014 ACM. |
| Language | 英语 ; 英语 |
| WOS ID | WOS:000349142100010 ; WOS:000349142100010 |
| Citation statistics | |
| Content Type | 会议论文 |
| URI | http://ir.iscas.ac.cn/handle/311060/16601 |
| Collection | 中国科学院软件研究所 |
| Recommended Citation GB/T 7714 | Yan, Shengen ,Li, Chao ,Zhang, Yunquan ,et al. YaSpMV: Yet another SpMV framework on GPUs, YaSpMV: Yet another SpMV framework on GPUs[C]. Association for Computing Machinery, Association for Computing Machinery,2014, 2014:107-118, 107-118. |
| Files in This Item: | There are no files associated with this item. | |||||
Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.
Edit Comment