cuFFT R2C, C2R 800x slower than C2C in a convolution

I am doing a 3D convolution and am observing dramatic differences in speed for R2C, C2R vs C2C, C2C.

x, y are complex (float32, float32) of dimension (64, 64, 512)

  1. C2C: real( ifft3( fft3(x) * fft3(y) ) )

  2. R2C, C2R: irfft3( rfft3( real(x) ) * rfft3( real(y) ) )

I get the correct results in both cases but case 2 is 800x slower.

Intermediate R2C results are (64, 64, 257) as instructed in cuFFT library guide.

What am I doing wrong?

(CUDA 3.0 Mac OS 10.6.3 GeForce 9600M GT)

I don’t know about the factor of 800, but the r2c and c2r transforms are slower than the c2c transforms (a factor of 2-3 I think). (At least they were in 2.3, I haven’t checked yet in 3.0).

Quoting CUFFT Library docs:

I would be very happy with equal performance. My 3D matrices are quite big. I feel like I fall into the optimal case for R2C. 800x slower indicates I am doing something wrong.

Not sure what else to try. My matrix dimensions are powers of 2. Any tips? Would profiling the code help (I have never tried profiling)?

My guess from profiling the code is that R2C and C2R isn’t implemented. Data is copied to an auxilary buffer and C2C is run on that buffer instead