Kernels that modify a 1D FFT Problem because power-of-two issue


unfortunately, the memory requirement of an FFT transform is not power of two because the library requires one additional cufftComplex.

I had no problems so far, but now I need to modify an FFT by using a kernel. How do you deal with such an issue in general?

For ffts larger than 512, because the block size is limited to 512 on my 8600 GT, the only option is to create a block size of 1,1,1 and a grid size of n,1,1.

I fear that this is totally worse for the performance because I’m unable to use a good block size.

How do you deal with such a thing in general? Would be nice if the result of an FFT transform would be power-of-two…


Its not that bad. You just make your block size 256 x 1 x 1 and your grid size (n+255)/256 x 1 x 1 and put a simple if(x < n) in your kernel. The extra complex means you will normally end up using one more block than you actually need though.

However, I agree it would be much nicer if CUFFT implemented proper real to half-complex transforms without needing an extra complex (as per Numerical Recipes in C or Intel Performance Primitives). Please nVidia!?

Actually, there is more bad news. If you want to batch 1D transforms then CUFFT insists on making the pitch/stride 2 * (N/2 + 1) floats which is guaranteed to screw you up from a global memory coallescing point of view! Am I missing something or is that a serious problem?

I would like to second that. The memory alignment is a real issue for me in batching. Perhaps moving the extra alias term (N/2+1) into the complex part of the DC term (0) would work without any loss of information since the imaginary parts of both should be zero. Another possibility is to add another option in the discards the last alias term (e.g. CUFFT_R2C_NO_ALIAS)