Can CUFFT do more fast on small size img?

Hello,everyone. These days I do some FFT benchs on my GF8800 Ultra card,I find some problems:do simple 2d fft on small images,GPU not work effective than biger images,below is some test result:
256x256 8bit image r2c: 0.14ms
512x512 8bit image r2c: 0.38ms
1024x1024 8bit image r2c: 1.3 ms
I read some document like"FFT-based 2d convolution",but it just tell the result,not tell how to do more fast FFT on small images like 256x256 size,anyone can tell me how to do effective FFT on small images? Thanks

What do you mean by “slow”? Note that 256 x 256 is just 64K points, which is far too tiny workload for GPU (or even for CPU), so driver/OS overhead and PCI-express bandwidth eat all the performance.

Do you have multiple images you can process at once? You really need to load the GPU with a bigger problem to get better efficiency.

Yes,I want to do FFT on multiple images at once for a big problem,but the cufft.lib not support 2D fft batch,anyone can tell me how to do that?

Well, you could try queueing up the cufft calls like

- Load all images into GPU mem in one copy

- Do cuFFT on all images

- Transfer back the result to CPU mem in one copy

Another possibility would be to implement the FFT kernel yourself, and make each couple of blocks process a different image. This has the advantage of reducing kernel call overhead.

I wrote a simple batched 2D FFT that uses CUFFT. You can find it here: