I’m using CUDA for computing 3d FFT’s of various sizes and I am noticing some strange effects with regards to the chosen dimensions. I had expected that the
power-of-two dimensions like 256x256x256 would be fastest, but apparently 216 (216=6^3) cubic and 243 (5^3) cubic and 343 (7^3) cubic are alot faster. Even
while 343^3 (around 40M voxels) is a lot bigger than 256^3 (around 16M voxels) the first is still much faster. Apparently dimensions made up of small prime
factors perform good as is documented in the CUDA FFT pdf but I can’t understand why 256^3 performs relatively bad. I wondered if anybody had an explanation
for this effect or that I would be doing something wrong. The results:
dimensions CUDA time in ms fftw time in ms
256^3 369.190277 ~ 830
216^3 78.293449 ~ 570
243^3 104.674583 ~ 1240
343^3 219.379028 ~ 3127
I checked the CUDA profiler log and compared some of the calls for 256^3 to 343^3. Part of the 256^3 calls look like this:
method=[ _Z13c2c_radix4_spifiPvS_i11tfStride_st ] gputime=[ 40.320 ] cputime=[ 106.150 ] occupancy=[ 0.417 ] gst_incoherent=[ 0 ] gst_coherent=[ 1792 ]
method=[ Z13c2c_transposeiiiiiiPvS ] gputime=[ 34.816 ] cputime=[ 55.012 ] occupancy=[ 0.333 ] gst_incoherent=[ 7800 ] gst_coherent=[ 160 ]
method=[ Z13c2c_transposeiiiiiiPvS ] gputime=[ 30.304 ] cputime=[ 50.332 ] occupancy=[ 0.333 ] gst_incoherent=[ 0 ] gst_coherent=[ 2576 ]
method=[ _Z13c2c_radix4_spifiPvS_i11tfStride_st ] gputime=[ 40.128 ] cputime=[ 60.324 ] occupancy=[ 0.417 ] gst_incoherent=[ 0 ] gst_coherent=[ 2048 ]
method=[ Z13c2c_transposeiiiiiiPvS ] gputime=[ 32.928 ] cputime=[ 52.936 ] occupancy=[ 0.333 ] gst_incoherent=[ 7740 ] gst_coherent=[ 144 ]
method=[ Z13c2c_transposeiiiiiiPvS ]
There are about 5160 radix4 calls and twice as much transpose calls.
for 343^3 it looks like this:
method=[ _Z13c2c_radix7_spifPvS_i11tfStride_st ] gputime=[ 109.088 ] cputime=[ 128.993 ] occupancy=[ 0.250 ] gst_incoherent=[ 14378 ] gst_coherent=[ 112 ]
method=[ _Z13c2c_radix7_spifPvS_i11tfStride_st ] gputime=[ 106.624 ] cputime=[ 126.864 ] occupancy=[ 0.250 ] gst_incoherent=[ 15076 ] gst_coherent=[ 64 ]
Here there are about 3510 radix7 calls, While I’m not really into the working of FFT algorithms, apparently some ‘transpose’ must be performed for 256^3 but
not for 343^3. Assuming that the algorithm for power-of-2 is FFT’s is different than for other dimensions I wouldn’t expect it to be slower.
These results are very different from 2D FFT’s where power-of-2 dimensions perform best by far.
Aside from that I noticed that planning times are almost neglectible (about 0.01 to 0.10 miliseconds), while I read somewhere that planning times are a big
bottleneck especially for smaller FFT’s (50-80 ms). Could this be due to the newest CUDA version and improvements to CUFFT ?
All times were measured using the builtin timer functions from CUDA.
Thanks in advance for any insight,