English Abstract: |
Fast Fourier Transform (FFT) is widely used in numerical computing, computer graphics, deep learning, and other fields. As the relatively high percentage part of time consumption, using a high-performance FFT library can significantly improve the performance and reduce computation time in applications. Along with the rapid development of China’s domestic processors, the associated software ecology faces serious problems: few supporting software and inadequate adaptation. Therefore, developing essential, high-performance software for China’s domestic processors is significant for realizing independent software and hardware control. The existing open-source FFT library cannot fully utilize the hardware performance of the Sunway multi-core processor. There are problems such as long tuning time and unstable calculation performance. In addition, there is a lack of a multi-level and complete adaptive tuning framework and corresponding FFT library.
This paper designs and implements a high-performance library to support powerof-2 FFT computation for Sunway multi-core processors, including an adaptive tuning framework based on the computational graph model, which models the FFT decomposition problem using hardware parameters to achieve fast tuning, and a load-balanced parallel task partitioning algorithm for processors with symmetric multi-processing architecture. Meanwhile, we use automatic code generation, vectorization, data reordering, and other optimization techniques to optimize. On the Sunway 3231 platform, the performance of the FFT library designed and implemented in this paper achieves an average of 1.94 times and a maximum of 2.71 times speedup compared to the open-source library FFTW; an average of 5.49 times and a maximum of 39.75 times speedup compared to FFTW under 32 threads, and an average of 1030 times speedup compared to FFTW for the adaptive tuning framework. |