3DFFT efficiency


It’s been quite a while I’ve been doing my last (big) project using CUDA. (Actually we were using CUDA toolkit/SDK version 2.3 at that time - two years ago)
But I remember quite well reading some statements of several people saying that the 3D CUDA implementation of the FFT in the CUDA libraries is rather inefficient compared to the 2D version.
Which means that compared to the quite optimized FFT implementations in Matlab the CUDA implementation for 2D shows significant/“breathtaking” speedups whereas the 3D version doesn’t or better say didn’t.

Is this still true todays?
Or can one say that the FFT implementations - no matter if 2D or 3D version - is quite well optimized in CUDA and shows significant speedups compared to CPU/Matlab implementations?

To be precise I’m planning to implement a fast MRI reconstruction (gridding, 3DFFT, …) algorithm on a CUDA GPU.
But before I’ll to ask annoying questions about this I’m going to read through the “MRI on CUDA” section right here!

Thanks for your help in advance!

so far i’ve found the NukadaFFT library: (homepage or thread in this forum)

as mentioned in their paper the 3D CUFFT shows low performance especially for non powers-of-two transform sizes.
BUT their timing results look pretty promising!

so i guess i will go in this direction (taking the NukadaFFT library) for my project.
does anyone have good points against using this library?

thanks, fab