3D CUFFT strange effect on volume dimensions 3D CUFFT strange effect on volume dimens

immo · April 25, 2008, 10:17am

Hello,

I’m using CUDA for computing 3d FFT’s of various sizes and I am noticing some strange effects with regards to the chosen dimensions. I had expected that the
power-of-two dimensions like 256x256x256 would be fastest, but apparently 216 (216=6^3) cubic and 243 (5^3) cubic and 343 (7^3) cubic are alot faster. Even
while 343^3 (around 40M voxels) is a lot bigger than 256^3 (around 16M voxels) the first is still much faster. Apparently dimensions made up of small prime
factors perform good as is documented in the CUDA FFT pdf but I can’t understand why 256^3 performs relatively bad. I wondered if anybody had an explanation
for this effect or that I would be doing something wrong. The results:

dimensions CUDA time in ms fftw time in ms
256^3 369.190277 ~ 830
216^3 78.293449 ~ 570
243^3 104.674583 ~ 1240
343^3 219.379028 ~ 3127

I checked the CUDA profiler log and compared some of the calls for 256^3 to 343^3. Part of the 256^3 calls look like this:

method=[ _Z13c2c_radix4_spifiPvS_i11tfStride_st ] gputime=[ 40.320 ] cputime=[ 106.150 ] occupancy=[ 0.417 ] gst_incoherent=[ 0 ] gst_coherent=[ 1792 ]
method=[ Z13c2c_transposeiiiiiiPvS ] gputime=[ 34.816 ] cputime=[ 55.012 ] occupancy=[ 0.333 ] gst_incoherent=[ 7800 ] gst_coherent=[ 160 ]
method=[ Z13c2c_transposeiiiiiiPvS ] gputime=[ 30.304 ] cputime=[ 50.332 ] occupancy=[ 0.333 ] gst_incoherent=[ 0 ] gst_coherent=[ 2576 ]
method=[ _Z13c2c_radix4_spifiPvS_i11tfStride_st ] gputime=[ 40.128 ] cputime=[ 60.324 ] occupancy=[ 0.417 ] gst_incoherent=[ 0 ] gst_coherent=[ 2048 ]
method=[ Z13c2c_transposeiiiiiiPvS ] gputime=[ 32.928 ] cputime=[ 52.936 ] occupancy=[ 0.333 ] gst_incoherent=[ 7740 ] gst_coherent=[ 144 ]
method=[ Z13c2c_transposeiiiiiiPvS ]
…

There are about 5160 radix4 calls and twice as much transpose calls.

for 343^3 it looks like this:

method=[ _Z13c2c_radix7_spifPvS_i11tfStride_st ] gputime=[ 109.088 ] cputime=[ 128.993 ] occupancy=[ 0.250 ] gst_incoherent=[ 14378 ] gst_coherent=[ 112 ]
method=[ _Z13c2c_radix7_spifPvS_i11tfStride_st ] gputime=[ 106.624 ] cputime=[ 126.864 ] occupancy=[ 0.250 ] gst_incoherent=[ 15076 ] gst_coherent=[ 64 ]
…
Here there are about 3510 radix7 calls, While I’m not really into the working of FFT algorithms, apparently some ‘transpose’ must be performed for 256^3 but
not for 343^3. Assuming that the algorithm for power-of-2 is FFT’s is different than for other dimensions I wouldn’t expect it to be slower.

These results are very different from 2D FFT’s where power-of-2 dimensions perform best by far.

Aside from that I noticed that planning times are almost neglectible (about 0.01 to 0.10 miliseconds), while I read somewhere that planning times are a big
bottleneck especially for smaller FFT’s (50-80 ms). Could this be due to the newest CUDA version and improvements to CUFFT ?

All times were measured using the builtin timer functions from CUDA.

Thanks in advance for any insight,

Kevin

vpodlozhnyuk · April 25, 2008, 11:36am

Hello, Kevin. You are right, power-of-two 1D transforms are the fastest and this looks like an issue with the existing CUFFT code. As for planning, CUFFT doesn’t do exhaustive time-consuming tuning like FFTW, only internal memory allocation plus host-side “infrastructure” initialization. We are currently working on a new FFT implementation with increased performance and decreased memory footprint, as well as certain API extensions.

Topic		Replies	Views
cufft performance CUDA Programming and Performance	2	12891	March 10, 2011
Bad Performance of CUFFT library? compilation flags for optimizing fft performance CUDA Programming and Performance	11	13538	February 17, 2012
Are there FFT performance "magic numbers"? CUDA Programming and Performance	1	11409	February 28, 2007
Different results between FFTW and CUFFT CUDA Programming and Performance	6	11853	April 15, 2010
CUFFT input sizes CUDA Programming and Performance	2	599	March 7, 2015
3DFFT efficiency CUDA Programming and Performance	1	4144	June 8, 2011
cuFFT Performance for increasing array sizes GPU-Accelerated Libraries	2	638	November 20, 2019
Best FFT library for Fermi architecture what do you use for best performance? CUDA Programming and Performance	4	12267	March 22, 2013
Writing custom FFT for sizes other than powers of 2 CUDA Programming and Performance	2	5115	September 29, 2010
cufft doubt comparing r2c and c2c 2D FFTs CUDA Programming and Performance	28	13584	October 27, 2010

3D CUFFT strange effect on volume dimensions 3D CUFFT strange effect on volume dimens

Related topics