khosra
April 22, 2010, 6:02am
1
I am doing a 3D convolution and am observing dramatic differences in speed for R2C, C2R vs C2C, C2C.
x, y are complex (float32, float32) of dimension (64, 64, 512)
C2C: real( ifft3( fft3(x) * fft3(y) ) )
R2C, C2R: irfft3( rfft3( real(x) ) * rfft3( real(y) ) )
I get the correct results in both cases but case 2 is 800x slower.
Intermediate R2C results are (64, 64, 257) as instructed in cuFFT library guide.
What am I doing wrong?
(CUDA 3.0 Mac OS 10.6.3 GeForce 9600M GT)
eelsen
April 22, 2010, 6:03pm
2
I am doing a 3D convolution and am observing dramatic differences in speed for R2C, C2R vs C2C, C2C.
x, y are complex (float32, float32) of dimension (64, 64, 512)
C2C: real( ifft3( fft3(x) * fft3(y) ) )
R2C, C2R: irfft3( rfft3( real(x) ) * rfft3( real(y) ) )
I get the correct results in both cases but case 2 is 800x slower.
Intermediate R2C results are (64, 64, 257) as instructed in cuFFT library guide.
What am I doing wrong?
(CUDA 3.0 Mac OS 10.6.3 GeForce 9600M GT)
I don’t know about the factor of 800, but the r2c and c2r transforms are slower than the c2c transforms (a factor of 2-3 I think). (At least they were in 2.3, I haven’t checked yet in 3.0).
khosra
April 25, 2010, 1:25am
3
Quoting CUFFT Library docs:
For 1D transforms, the performance for real data will either match or be less than the complex equivalent (due to an extra copy in come cases). However, there is usually a performance benefit to using real data for 2D and 3D FFTs, since all transforms but the last dimension operate on roughly half the logical signal size.
I would be very happy with equal performance. My 3D matrices are quite big. I feel like I fall into the optimal case for R2C. 800x slower indicates I am doing something wrong.
Not sure what else to try. My matrix dimensions are powers of 2. Any tips? Would profiling the code help (I have never tried profiling)?
Quoting CUFFT Library docs:
I would be very happy with equal performance. My 3D matrices are quite big. I feel like I fall into the optimal case for R2C. 800x slower indicates I am doing something wrong.
Not sure what else to try. My matrix dimensions are powers of 2. Any tips? Would profiling the code help (I have never tried profiling)?
My guess from profiling the code is that R2C and C2R isn’t implemented. Data is copied to an auxilary buffer and C2C is run on that buffer instead