Streams and CUFFT

I think I cannot do this, but I wanted to confirm:

I wanted to call cufftExecC2C (or any CUFFT really) within different
streams. I think the cufft calls are not callable from within kernel
routines, and I think that means I am out of luck. I had a simple
kernel defined like:

global void
fftKernel(cufftHandle fftPlan, cufftComplex *d_fftArrayA, cufftComplex *d_fftArrayB)
// now call our fft
CUFFT_SAFE_CALL(cufftExecC2C(fftPlan, d_fftArrayA, d_fftArrayB, CUFFT_FORWARD));

Then I wanted to call that as in:

fftKernel<<<nblocks, nthreads, streamArr[i]>>>(fftPlan[i], fftDeviceArrayA[i], fftDeviceArrayB[i]);

I get nvcc compile errors for this; I believe it is angry about trying to call
the cufft routine within another kernel – I recall reading somewhere that
cufft calls were esentially kernel calls in and of themselves.

So - is there a way to call a cufft routine within a stream? Thoughts on this would
be appreciated, thanks.

I’d like to know if CUFFT works in a stream as well.

I am not sure it will work with streams, you may try to modify the source code. Also remember that in order to use async calls, the data need to be in pinned memory.

By the way, your call is wrong:

fftKernel<<<nblocks, nthreads, streamArr[i]>>>(fftPlan[i], fftDeviceArrayA[i], fftDeviceArrayB[i]);

should be
fftKernel<<<nblocks, nthreads, 0, streamArr[i]>>>(fftPlan[i], fftDeviceArrayA[i], fftDeviceArrayB[i]);

You are missing the shared memory info.

Thanks for the info, Massimiliano.

Has anyone got cufft running in a stream as yet? Also, when I run cufft through the profiler, I see kernel code with _mpsm and _mpgm extensions. I assume this has to do with shared mem and global mem access? I don’t see this in the CUFFT source code release… do we have the complete source code to get equivalent performance as running the host callable CUFFT routines?

Thank you,


I think CUFFT works with streams if you use batched FFTs.

I thought the batched ffts only changes the grid dimensions on the kernel code of cufft. Have you had success using batched ffts and launching cufft on a stream?

I have noticed that not all the CUFFT code is provided to us, so it will take some time to get it to work with streams if we have to modify the source code. I have posted a request to NVIDIA to see whether they have any advice or (preferably) make the entire CUFFT library source code available to us. No answer either way yet … :(


I thought I did, but I was confused about streaming. Does anyone know if someone else has written an FFT routine for CUDA?

Does anybody know where to download the source code for CUFFT?