run cufft in loop is it possible?

I have a C program that has a 4096 point 2D FFT which is looped 3096 times. In the equivalent CUDA version, I am able to compute the 2D FFT only once.

What is the procedure for calling a FFT inside a kernel ?? Is it possible??
The CUDA SDK did not have any examples that did this type of calculations.

I am also not sure if a batch 2D FFT can be done for solving this problem. The cuFFT documentation does not have any batch FFT examples.

Has anyone done cuFFT in a loop?? Any thoughts or suggestions??

batch mode for a 2D FFT is not supported on CUDA? THat would have been the solution. Does anybody know how to get around it??


I almost have the same question as in I have a 2-D Complex array and I want to perform cufft on every row.
I know that I can have a for loop in my CPU code as shown below:

for(r=0; r<Nr_; r++)
for ( t=0; t<Nt_; t++) //fill temp stuff to pass to cufft
h_dataout[t].x = real(u_(r,t)) ;
h_dataout[t].y = imag(u_(r,t)) ;
cudaMemcpy(d_dataout, h_dataout, sizeof( cufftComplex ) * Nt_, cudaMemcpyHostToDevice);
cufftExecC2C(p,d_dataout, d_datain, CUFFT_INVERSE);//execute fft on this row
cudaMemcpy(h_datain, d_datain, sizeof( cufftComplex ) * Nt_, cudaMemcpyDeviceToHost);

but i am wondering if i can send the whole array to kernel and have each thread to perform cufft on each row? and if there is anyway to control cufft from kernel???

can anyone help me with this please??


I would like to know about that, too, as mentioned in my previous post. It would be preferrable to have a cufft function callable from a kernel.

Perhaps calling Vasily’s FFT code from the kernel would work, at least for fixed size arrays.


yes i think it will really useful to call cufft from kernel.

you mentioned Vasily’s FFt code, what is this code and where can I find it?

thanks :)

The code is contained in the post entitled “my speedy FFT, 3x faster than CUFFT” by vvolkov.

I think I’ll take the FFT code inside Vasily’s kernels and make functions usable by my kernels. I’ll have to figure out what the load<>, twiddle<>, transpose<> and store<> calls are before that though.


did you manage to undertand how to use the speedy FFT. I just used a loop in CPU to call cufft and it doesnt scale well at all.

To answer the first question: not entirely. I have replaced the cuFFT calls to calls to Volkov’s FFTxxx and performance was improved significantly. My code, which is a sequence of 3 x (kernel, FFT) executed in 15.8ms using cuFFT and 8.9ms using Volkov’s FFT. (The job of the kernels is to shuffle data around in order to create arrays of input vectors to the FFT batch). The 3 FFTs (60x8x8 FFT512 + 512x60x8 FFT8 + 512x60x8 FFT8) by themselves take 1.4ms now, in comparison to 8ms using cuFFT.

My understanding is that a FFTxxx( float2 *work, int batch ) distributes the calculations of the FFTs of the batch of xxx long vectors onto its own threads and blocks, using a FFTxxx_device( float2 *work) kernel. In this kernel, the work pointer gets recalculated and then some magic happens which is beyond my comprehension. For example, I do not quite see how the FFT on a certain vector from the batch is distributed accros these threads, or which threads work on that vector from the batch?

The reason I need further improvement is to get the total processing time below 4ms. In essence I have a three dimensional data structure of KxMxN, say a volume x-y-z. I need to perform an FFT on 1) all MxN vectors of length K (say in x direction), followed by 2) on all KxM vectors of length N (say in z direction), and followed by 3) on all KxN vectors of length M (say in z direction). IN order to do that, I need to rearrange data, and my data shuffling kernels take 6.1ms which is the reason why I need either:

A include the FFTxxx in my kernel,

take the contents of the kernel for a particular FFTxxx and paste it into my kernel, the execution of which is determined by my data structures as far as number of threads and blocks are concerned. But I have the impression that does not work.

B include the data shuffle in the FFTxxx kernels.

Here I would have to understand how a vector is composed in the FFTxxx kernel, or how to reassemble a certain, contiguous vector from a global data structure.

Conclusion: I need to understand Volkov’s code, or as the Austrian writer Egon Fridell put it: If you steal a race horse and you want to ride it, your riding skills have to be at the height of the person who trained it.



Thanks Peter, I found a way to do FFT in a loop, its basically really simple.

when creating the plan:

cufftplan1d(cufftHandle *plan, int nx, cufftType type, int batch)

the fourth argument, the batch, will set the loop number.

hi mandana,

can you please provide me with a simple version of how your loop looks like. That would be really nice. I tried it with the same method, but somehow my result looks really strange and i cannot figure out, if i “hit” the right rows within my loop.

regards, rob

Hi Rob,

here you can see my code:

cufftResult result=cufftPlan1d(&p, Nt_, CUFFT_C2C, Nr_);


    printf("plan created succesfully\n");

cudaMemcpy(d_datain, h_datain, sizeof( cufftComplex ) * Nt_ * Nr_, cudaMemcpyHostToDevice);

cufftExecC2C(p,d_datain, d_dataout, CUFFT_FORWARD);//execute fft on this row

cudaMemcpy(h_dataout, d_dataout, sizeof( cufftComplex ) * Nt_ * Nr_, cudaMemcpyDeviceToHost);

basically here my input is a very large 1D array and the code performs cufft on chuncks of size Nt_ in this array. the last argument in the plan function, is the batch that sets the loop number.