I am interested in cuda DFT implementation, such as 1200-point DFT and 12-point DFT which are not power of 2. While DFT is covered by CUFFT, the performance is not entirely satisfactory to me. For example, the time cost of 1200-point DFT is more than 3 times of 2048-point FFT. By profiling, I noticed that 1200-point CUFFT executes 5 kernel functions as radix2, radix4, radix5, radix5, radix6. Can someone provide leads for some fast cuda DFT implementations? It will also be appreciated if you believe the performance of CUFFT for 1200-point is reasonable and give some explanation. Thanks
Related topics
Topic | Replies | Views | Activity | |
---|---|---|---|---|
FFT performance loss? FFT-Size 16k and above | 0 | 2143 | May 9, 2008 | |
CUFFT performance | 2 | 3427 | September 15, 2009 | |
CUFFT DP performance inconsistent | 3 | 1334 | August 14, 2009 | |
3DFFT efficiency | 1 | 4136 | June 8, 2011 | |
Writing custom FFT for sizes other than powers of 2 | 2 | 5098 | September 29, 2010 | |
CUFFT: calculation time | 6 | 2671 | April 21, 2012 | |
Query regarding CUFFT | 1 | 1824 | July 24, 2008 | |
CUFFT_EXEC_FAILED? | 1 | 3748 | October 10, 2008 | |
Cuda FFT running faster than simple copies and magnitude squared | 3 | 983 | January 19, 2016 | |
FFT Cuda implementation | 4 | 849 | June 3, 2021 |