Hi, I’d like to share an implementation of the FFT that achieves 160 Gflop/s on the GeForce 8800 GTX, which is 3x faster than 50 Gflop/s offered by the CUFFT. It is designed for n = 512, which is hardcoded. The implementation also includes cases n = 8 and n = 64 working in a special data layout.
Update (Sep 8, 2008): I attached a newer version that includes n = 16, n = 256 and n = 1024 cases. Here is its performance in Gflop/s when using driver 177.11 (up to ~40% slower with the newer 177.84 version):
Update (Jan 12, 2009): I attached a quickly patched version that supports forward and inverse FFTs for n = 8, 16, 64, 256, 512, 1024, 2048, 4096 and 8192. Here are results for CUDA 2.1 beta, driver 180.60:
Some of the layouts used are not straightforward. If you need filtering based on element-wise multiplication by a vector in the frequency space, I’d suggest using a similar layout for the vector. IFFT_011209.zip (19.5 KB) FFT_090808.zip (11 KB) FFT_061408.zip (7.43 KB)
Thanks for looking in the code! I tried your technique but didn’t get any difference. According to decuda, both choices produce nearly identical machine code. The only difference is that a couple of instructions were reordered and few registers were renamed. It seems that __sincosf is compiled into the same instructions as pair of __cosf and __sinf.
Vasily, I was surprised to hear it didn’t make a difference - I have some rotation code that got a nice benefit from using sincosf versus sinf and cosf. Thanks to both of you for the explanation. Turns out I use the full precision calls.
I have a couple of questions, i’m kind of a beginner so it might not be very interesting…
I’ve been working on DCTs recenlty, that’s why i’m interested in your code.
__global__ void FFT512_device( float2 *work )
{
�   work += (blockIdx.y * gridDim.x + blockIdx.x) * 512;
�  Â
�   int tid = threadIdx.x;
�   int hi = tid>>3;
�   int lo = tid&7;
�  Â
�   float2 a[8];
�   __shared__ float smem[8*8*9];
�  Â
�   load( a, work + tid, 64 );
�  Â
�   FFT8( a );
�  Â
�   twiddle( a, tid, 512 );
�   tranpose( a, &smem[hi*8+lo], 66, &smem[lo*66+hi], 8 );
�  Â
�   FFT8( a );
�  Â
�   twiddle( a, hi, 64 );
�   tranpose( a, &smem[hi*8+lo], 8*9, &smem[hi*8*9+lo], 8, 0xE );
�  Â
�   FFT8( a );
�  Â
�   store( a, work + tid, 64 );
}
Why don’t use put the “float2 a[8]” in shared mem ? Is it because of shared mem size limitation ?
Why don’t you synchronize threads before calling FFT8 ? Or after, or even between twiddle and transpose ? I don’t really understand why it should work.
I have tried to launch your code on CUDA 1.1, i replaced double by float, because they aren’t in CUDA 1.1.
Because shared memory is substantially slower than registers.
Synchronization is needed to coordinate access to shared memory. All shared memory accesses are done in transpose(), that performs all synchronization that is required.
I was using CUDA 2.0 when writing this code. Later, I found that it runs much slower if compiled in CUDA 1.1. I could try to reoptimize the code for this earlier version of the compiler, but don’t really see the point.
If you still want to call it from an application code running in CUDA 1.1, you may compile this code into .cubin using CUDA 2.0 and then call it from CUDA 1.1 code using driver API.
I don’t think so. All local variables are placed in registers unless spilled to local memory. FFT512_device() doesn’t spill if compiled in CUDA 2.0 but spills in CUDA 1.1. FFT8_device() contains similar declaration that does not spill in either case.
Then you have constant indexing maybe? Any array declared inside a global function is put in local memory, unless you have constant indexing (I looked briefly at your code yesterday evening, and I think you are indeed using constant indexing because your loops are unrolled, so for your case it will be put in registers indeed) This is irregardless of spilling registers over to local memory.
As an example, compile a global function that calls sincosf(). You will see that local memory is used for that kernel, even when using <10 registers, just because the index cannot be determined at compile time. (It is used in a code path that might not get triggered during runtime, so it might be that you will never see local load & store in profiler output)
Sorry, I was mistaken as I was looking at shared memory allocation and I expected it to be double sized but I realized the algorithm works somewhat differently.
I have another question, is it easy to modify the algorithm to produce fft of size 1024 ?
It is fairly easy to accomplish. I have produced such algorithm by upgrading n=512 case up to radix-16. Of course, 161616 > 1024, so one of the stages (the middle one) has a lower radix of 4, i.e. computes four independent radix-4 transforms instead. Higher radix requires more registers but I did manage to fit into 40 so that 3 thread blocks can run at a time per multiprocessor. It still fits into shared memory due to the particular shared memory allocation that you already found. I applied a fair effort to tune it but ultimately failed to get higher Gflop/s rate than in the n=512 case so I didn’t post it here.
Vasily, I’ve read some of your papers, and noticed that some of your algorithm improvements have been included in CUBLAS 2.0; are you going to be adding the improvements from this work into the next release of CUFFT?