3d CUFFT issues / new implementation?

Hi,

I have seen some hints on this forum that we can expect a new version of CUFFT that should improve performance and fix some problems especially regarding 3d fft’s. 3d fft performance is poor for small dimensions (for example 32x32x32) even without taking memory copies into account, sometimes even ten times as slow as FFTW. I was wondering if anybody has similar experiences and possibly even some solutions for this.

Any estimate on the release date of the new CUFFT implementation?

Regards,

Kevin

Like every other CUDA functions, you can not archive higher performance than CPU version (which exploits much smarter cache strategy, and no call overhead) with small input sizes. So i don’t think this problem can be resolved soon

The fact that the GPU doesn’t perform well for small workloads cannot be hidden. But is it possible to batch up these small tranforms in your particular application?

I am now using a mix of CUFFT for larger FFT’s and FFTW for small FFT’s in my program. Unfortunately there is some overhead for copying data between host and device. I didn’t know there was a way to batch 3D FFT’s, I tried to find any information about it but couldn’t find it. Not too sure if my program would really benefit from it though…

Regards,

Kevin

Assuming the existense of batched 3D transforms in CUFFT, what is the batch size you would be able to supply from your application’s design?

My application uses multiple 3D FFTs of size 24. And obviously, the CUDA FFT doesnt perform well because of transfer overheads. There is a good spike in performance for size 49 but I don’t need that.

Attached is a plot of performance for 3D FFT.

Any help with how to increase the performance for size = 24 would really be appreciated.

Thanks
fc.JPG

That’s a difficult one, it largely depends on the amount of memory that i still have available on the graphics card. In total i will for instance need to do about 40k transforms of 32x32x32 or 64x64x64, so depending on the amount of memory left i’d say about 1000 at a time. That would take around 32x32x32x4x1000 = 131 MB of memory.