3D CUFFT strange effect on volume dimensions 3D CUFFT strange effect on volume dimens

Hello,

I’m using CUDA for computing 3d FFT’s of various sizes and I am noticing some strange effects with regards to the chosen dimensions. I had expected that the
power-of-two dimensions like 256x256x256 would be fastest, but apparently 216 (216=6^3) cubic and 243 (5^3) cubic and 343 (7^3) cubic are alot faster. Even
while 343^3 (around 40M voxels) is a lot bigger than 256^3 (around 16M voxels) the first is still much faster. Apparently dimensions made up of small prime
factors perform good as is documented in the CUDA FFT pdf but I can’t understand why 256^3 performs relatively bad. I wondered if anybody had an explanation
for this effect or that I would be doing something wrong. The results:

dimensions CUDA time in ms fftw time in ms
256^3 369.190277 ~ 830
216^3 78.293449 ~ 570
243^3 104.674583 ~ 1240
343^3 219.379028 ~ 3127

I checked the CUDA profiler log and compared some of the calls for 256^3 to 343^3. Part of the 256^3 calls look like this:

method=[ _Z13c2c_radix4_spifiPvS_i11tfStride_st ] gputime=[ 40.320 ] cputime=[ 106.150 ] occupancy=[ 0.417 ] gst_incoherent=[ 0 ] gst_coherent=[ 1792 ]
method=[ Z13c2c_transposeiiiiiiPvS ] gputime=[ 34.816 ] cputime=[ 55.012 ] occupancy=[ 0.333 ] gst_incoherent=[ 7800 ] gst_coherent=[ 160 ]
method=[ Z13c2c_transposeiiiiiiPvS ] gputime=[ 30.304 ] cputime=[ 50.332 ] occupancy=[ 0.333 ] gst_incoherent=[ 0 ] gst_coherent=[ 2576 ]
method=[ _Z13c2c_radix4_spifiPvS_i11tfStride_st ] gputime=[ 40.128 ] cputime=[ 60.324 ] occupancy=[ 0.417 ] gst_incoherent=[ 0 ] gst_coherent=[ 2048 ]
method=[ Z13c2c_transposeiiiiiiPvS ] gputime=[ 32.928 ] cputime=[ 52.936 ] occupancy=[ 0.333 ] gst_incoherent=[ 7740 ] gst_coherent=[ 144 ]
method=[ Z13c2c_transposeiiiiiiPvS ]

There are about 5160 radix4 calls and twice as much transpose calls.

for 343^3 it looks like this:

method=[ _Z13c2c_radix7_spifPvS_i11tfStride_st ] gputime=[ 109.088 ] cputime=[ 128.993 ] occupancy=[ 0.250 ] gst_incoherent=[ 14378 ] gst_coherent=[ 112 ]
method=[ _Z13c2c_radix7_spifPvS_i11tfStride_st ] gputime=[ 106.624 ] cputime=[ 126.864 ] occupancy=[ 0.250 ] gst_incoherent=[ 15076 ] gst_coherent=[ 64 ]

Here there are about 3510 radix7 calls, While I’m not really into the working of FFT algorithms, apparently some ‘transpose’ must be performed for 256^3 but
not for 343^3. Assuming that the algorithm for power-of-2 is FFT’s is different than for other dimensions I wouldn’t expect it to be slower.

These results are very different from 2D FFT’s where power-of-2 dimensions perform best by far.


Aside from that I noticed that planning times are almost neglectible (about 0.01 to 0.10 miliseconds), while I read somewhere that planning times are a big
bottleneck especially for smaller FFT’s (50-80 ms). Could this be due to the newest CUDA version and improvements to CUFFT ?

All times were measured using the builtin timer functions from CUDA.

Thanks in advance for any insight,

Kevin

Hello, Kevin. You are right, power-of-two 1D transforms are the fastest and this looks like an issue with the existing CUFFT code. As for planning, CUFFT doesn’t do exhaustive time-consuming tuning like FFTW, only internal memory allocation plus host-side “infrastructure” initialization. We are currently working on a new FFT implementation with increased performance and decreased memory footprint, as well as certain API extensions.