Batched 2D FFT implementation

Here is my implementation of batched 2D transforms, just in case anyone else would find it useful. I’ve developed and tested the code on an 8800GTX under CentOS 4.4. The API is consistent with CUFFT.

There is a lot of room for improvement (especially in the transpose kernel), but it works and it’s faster than looping a bunch of small 2D FFTs.

Thanks for all the help I’ve been given so far on this forum.

Jim Hardwick

Sorry… my attachments didn’t make it into the post. Attempt #2:


The board won’t let me upload an attachment with a .cu extension. So here is the .cu file - download it and rename it to


DELETED - Merged into above post

Here is an updated version with a faster transpose (based on the SDK transpose). Input dimensions are limited to multiples of 16.

Be sure to rename batchfft.cpp to The board doesn’t allow attachments with a .cu extension for some reason.

And here is a (slightly) faster version. The transpose kernel now has no bank conflicts. Array dimensions nx and ny must be multiples of 16.

Again, rename batchfft.cpp to

Thanks for posting the code!

I haven’t tried it yet - how much faster is the batched transform versus looping over several small transforms?

I’ve got a program that currently loops over 225 256x256 2D FFTs.

I’ll give your implementation a shot and post some benchmarks.

Thanks for posting this.

You’re welcome. I hope others find it useful. I’d also like to see any improvements people make to the code.

Obviously the speed improvement depends on the size of the transform and the number of transforms. Here is some output from my unit test code:

===Running testBatchFFT

-Size--------|-Direction---|-Time (ms)---|-Time/unit---|-Loop time---|-Time/unit---|-FFTW time---|-Time/unit---|-L2 of diff--|-PASS/FAIL---|

256x64x1     |forward      |0.192000     |0.192000     |0.125000     |0.125000     |1.633000     |1.633000     |0.000000e+00 |PASS         |

256x64x1     |inverse      |0.095000     |0.095000     |0.099000     |0.099000     |0.586000     |0.586000     |0.000000e+00 |PASS         |

256x64x10    |forward      |0.439000     |0.043900     |0.757000     |0.075700     |6.217000     |0.621700     |0.000000e+00 |PASS         |

256x64x10    |inverse      |0.441000     |0.044100     |0.755000     |0.075500     |5.917000     |0.591700     |0.000000e+00 |PASS         |

256x64x20    |forward      |0.839000     |0.041950     |1.455000     |0.072750     |11.941000    |0.597050     |0.000000e+00 |PASS         |

256x64x20    |inverse      |0.825000     |0.041250     |1.459000     |0.072950     |11.871000    |0.593550     |0.000000e+00 |PASS         |

256x64x25    |forward      |1.016000     |0.040640     |1.814000     |0.072560     |14.851000    |0.594040     |0.000000e+00 |PASS         |

256x64x25    |inverse      |1.015000     |0.040600     |1.824000     |0.072960     |14.856000    |0.594240     |0.000000e+00 |PASS         |

256x64x35    |forward      |1.405000     |0.040143     |2.517000     |0.071914     |20.860001    |0.596000     |0.000000e+00 |PASS         |

256x64x35    |inverse      |1.409000     |0.040257     |2.532000     |0.072343     |20.812000    |0.594629     |0.000000e+00 |PASS         |

256x64x70    |forward      |2.727000     |0.038957     |5.004000     |0.071486     |42.101002    |0.601443     |0.000000e+00 |PASS         |

256x64x70    |inverse      |2.734000     |0.039057     |5.035000     |0.071929     |42.091999    |0.601314     |0.000000e+00 |PASS         |

256x64x280   |forward      |10.712000    |0.038257     |20.025000    |0.071518     |170.156998   |0.607704     |0.000000e+00 |PASS         |

256x64x280   |inverse      |10.695000    |0.038196     |18.743000    |0.066939     |117.583000   |0.419939     |0.000000e+00 |PASS         |

256x128x70   |forward      |5.523000     |0.078900     |7.942000     |0.113457     |66.320999    |0.947443     |0.000000e+00 |PASS         |

256x128x70   |inverse      |5.508000     |0.078686     |7.924000     |0.113200     |65.714996    |0.938786     |0.000000e+00 |PASS         |

256x128x280  |forward      |22.312000    |0.079686     |31.667999    |0.113100     |264.674988   |0.945268     |0.000000e+00 |PASS         |

256x128x280  |inverse      |22.393999    |0.079979     |31.672001    |0.113114     |264.868011   |0.945957     |0.000000e+00 |PASS         |


Time = total time for the batched implementation

Loop time = total time for looped CUFFT

FFTW time = total time for single-threaded FFTW on a Q6600

Time/unit = each time divided by the number of transforms

L2 of diff = L2 norm of the difference between the batched and looped results

Don’t pay attention to the times in the first row - I didn’t bother to pre-initialize CUDA. Another thing to note is the data is already on the GPU. Host<->GPU transfers would obviously reduce performance.

FYI, my code runs 70 256x64 transforms at a time. The batched version is almost twice as fast at that size.


Thanks for posting this code, it was very helpful. I’ve posted some data from my 8600GTS, which I suspect would be on the low-end of the performance spectrum compared to what most people will be using.

I’m getting into a project which will require multiple 4D FFTs. Two ideas I’m pondering: 1) Using your method, perform another transpose around the z-axis to “flip” the data structure, then perform another batch of 2D FFTs. Or, 2) Perfom one 3-D FFT, then “flip” the structure and perform a batch of 1-D FFTs. Of course, all the data would remain on the device until completed.

I’m making the assumption that the 3D cuFFT performs a batch of 1D FFTs three times - but I don’t know if it has to transpose the data between batches or if it handles the data another way.

I’m a new user and I’m not sure I have a complete grasp on all of the variables yet, but I’d appreciate any ideas you or others might have.

Best Regards,

I am by no means an FFT guru, but I’ll give you my thoughts.

I would suspect implementing another transpose/1D FFT would be easiest. I understand there are some memory access patterns you can use in a 3D FFT that may be faster than another transpose on a CPU, but the transpose should allow for coalesced memory accesses on the third FFT pass on the GPU.

I’d be interested to see what you come up with (if you can share your results).

Can someone please post a sample for a code that uses this batchfft? I’m in my first steps of learning CUDA and I could really use the help…




There’s no error checking, but this should get you started:

int testBatchFFT(int nx, int ny, int count, int direction)


  int size = nx * ny;

  int byte_size = sizeof(float2) * size;

 // Allocate and initialize host memory

  float2* field = (float2*)malloc(byte_size * count);

  for(int i = 0; i < size * count; ++i)


    field[i].x = rand() / (float)RAND_MAX;

    field[i].y = rand() / (float)RAND_MAX;


 // Allocate device memory and copy to device

  float2* d_field;

  cudaMalloc((void**)&d_field, byte_size * count);

  cudaMemcpy(d_field, field, byte_size * count, cudaMemcpyHostToDevice);

 // Run the batch transform

  batchfftHandle batchplan;

  batchfftPlan2d(&batchplan, nx, ny, CUFFT_C2C, count);

  batchfftExecute(batchplan, d_field, d_field, direction);

 // cleanup




 return 0;


In your PM you said you want to do batched 1D transforms. As I mentioned in my reply, batched 1D transforms are already in CUFFT.

Sorry for digging up this old thread.

I was wondering, if this batch implementation could be somehow used to speed up a single 2D Complex2Complex FFT (and iFFT).

I need to perform a 1024x1024 C2C FFT, and found cufft to be slower than FFTW when data transfers are included (and yes, I use pinned memory).

(My platform is a Core2Quad at 3.0 GHz and a G92 8800GTS clocked at 750/1750/2100).

Is it possible to “decompose” a single 2D 1024x1024 C2C FFT in a batch of 1D FFTs?

Or, do you have any other suggestion? I know, ideally one should perform more computation on the transformed data before moving them off the GPU, but I can’t do that.



No, this won’t speed up a single FFT. It might actually make it slower than calling CUFFT directly.

1024x1024 isn’t that big as far as the GPU is concerned, so you might be better on the CPU if you can’t move any other computation to the GPU.


Are you aware of any speedup from CUDAFFT 1.1 to 2.0?


Quick question:
In my C program I am looping a 2D FFT 3096 times. What’s the procedure to do the equivalent operation in CUDA? If I understand correctly, Jim tried it and it was slower. So the need for a 2D batch FFT? Can someone please explain?? Also if it’s a 2D batch FFT, does that mean I need to compute 2D FFT over an array of 4096(FFT_size) x 3096 (no. of loops)??

I found a looped 2D FFT was indeed faster on the GPU than CPU, but I wrote this batched implementation to speed it up even more.

Are you running a 2D FFT on 3096 different arrays? Or are you running 3096 FFTs on one array? Batching only applies to the former case.


I have a (64x64) 4096 point 2D FFT. In C, the 2D FFT is looped 3096 times, because in every loop, the input is different. When I implement a CUDA version, I am able to compute the FFT only once. I am not sure whether I should loop it or do a batch FFT. At the same time, I am not able to do a loop as well as I am not sure how to change the inputs to the 2DFFT call.

If you can set up and fill in all 3096 input arrays contiguously in gpu memory, you can use a batched FFT. I’m not sure what you mean by only being able to compute the FFT once and not knowing how to change the inputs to the 2D FFT call. Can you be more clear or provide some code?


ok. Let me explain a bit more. My kernel computes 3096 positions.

Each position is responsible for a 64x64 array that undergoes FFT, multiplication, inv FFT.

I would like to run the FFT calls either in a loop or batch mode. Which is beneficial in terms of speed ?

Also You mentioned that you did a FFT call in a loop. How did you do it ?? A code example ??