cuFFT R2C, C2R 800x slower than C2C in a convolution

khosra · April 22, 2010, 6:02am

I am doing a 3D convolution and am observing dramatic differences in speed for R2C, C2R vs C2C, C2C.

x, y are complex (float32, float32) of dimension (64, 64, 512)

C2C: real( ifft3( fft3(x) * fft3(y) ) )
R2C, C2R: irfft3( rfft3( real(x) ) * rfft3( real(y) ) )

I get the correct results in both cases but case 2 is 800x slower.

Intermediate R2C results are (64, 64, 257) as instructed in cuFFT library guide.

What am I doing wrong?

(CUDA 3.0 Mac OS 10.6.3 GeForce 9600M GT)

eelsen · April 22, 2010, 6:03pm

I don’t know about the factor of 800, but the r2c and c2r transforms are slower than the c2c transforms (a factor of 2-3 I think). (At least they were in 2.3, I haven’t checked yet in 3.0).

khosra · April 25, 2010, 1:25am

Quoting CUFFT Library docs:

I would be very happy with equal performance. My 3D matrices are quite big. I feel like I fall into the optimal case for R2C. 800x slower indicates I am doing something wrong.

Not sure what else to try. My matrix dimensions are powers of 2. Any tips? Would profiling the code help (I have never tried profiling)?

laughingrice · April 27, 2010, 11:38pm

My guess from profiling the code is that R2C and C2R isn’t implemented. Data is copied to an auxilary buffer and C2C is run on that buffer instead