# cuFFT 2D Convolution

Hello,
I’m trying to perform a 2D convolution using the “FFT + point_wise_product + iFFT” aproach. Using NxN matrices the method goes well, however, with non square matrices the results are not correct. I’ve read the whole cuFFT documentation looking for any note about the behavior with this kind of matrices, tested in-place and out-place FFT, but I’m forgetting something.

I’ve tested the same algorithm with the same matrices in MATLAB and everthing is correct.
I show you a very simplified code, with a very basic kernel for clarity, its output and the expected output. What am I doing wrong? Thank you very much in advance.

const int W = 5;
const int H = 4;

//Signal, with just one value, for simplicity.
float A[H][W] = {{0,0,0,0,0},
{0,1,0,0,0},
{0,0,0,0,0},
{0,0,0,0,0}};

//Central element of the kernel in the (0,0) position of the array.
float B[H][W] = {{0.5, 0.1,  0,    0,  0.2},
{0  ,  0  ,  0,    0,  0},
{0  ,  0  ,  0,    0,  0},
{0  ,  0  ,  0,    0,  0}};

cufftReal* d_inA, *d_inB;
cufftComplex* d_outA, *d_outB;

size_t real_size = W * H * sizeof(cufftReal);
size_t complex_size = W * (H/2+1) * sizeof(cufftComplex);

cudaMalloc((void**)&d_inA, real_size);
cudaMalloc((void**)&d_inB, real_size);

cudaMalloc((void**)&d_outA, complex_size);
cudaMalloc((void**)&d_outB, complex_size);

cudaMemset(d_inA,0,real_size);
cudaMemset(d_inB,0,real_size);

cudaMemcpy(d_inA, A, real_size,	cudaMemcpyHostToDevice);
cudaMemcpy(d_inB, B, real_size,	cudaMemcpyHostToDevice);

cufftHandle fwplanA, fwplanB, bwplan;
cufftPlan2d(&fwplanA, W, H, CUFFT_R2C);
cufftPlan2d(&fwplanB, W, H, CUFFT_R2C);
cufftPlan2d(&bwplan, W, H, CUFFT_C2R);

cufftSetCompatibilityMode(fwplanA,CUFFT_COMPATIBILITY_NATIVE);
cufftSetCompatibilityMode(fwplanB,CUFFT_COMPATIBILITY_NATIVE);
cufftSetCompatibilityMode(bwplan,CUFFT_COMPATIBILITY_NATIVE);

cufftExecR2C(fwplanA, d_inA, d_outA);
cufftExecR2C(fwplanB, d_inB, d_outB);

int blocksx = ceil((W*(H/2+1 )) / 256.0f);
dim3 grid(blocksx);
// One complex product for each thread, scaled by the inverse of the
// number of elements involved in the FFT

cufftExecC2R(bwplan, d_outA, d_inA);

cufftReal* result = new cufftReal[W*2*(H/2+1)];
cudaMemcpy(result, d_inA, real_size,cudaMemcpyDeviceToHost);

// Print result...
// Free memory...

Output. Note the displaced value

-0.0	0.0	-0.0	-0.0	0.0
0.0	<b>0.5	0.1</b>	0.0	-0.0
<b>0.2</b>	0.0	0.0	-0.0	-0.0
-0.0	0.0	-0.0	-0.0	-0.0

Expected output (MATLAB)

0         0         0         0         0
0.2000    0.5000    0.1000    0.0000    0.0000
0         0         0         0         0
0         0         0         0         0

What do these lines do?

cufftSetCompatibilityMode(fwplanA,CUFFT_COMPATIBILITY_NATIVE);
cufftSetCompatibilityMode(fwplanB,CUFFT_COMPATIBILITY_NATIVE);
cufftSetCompatibilityMode(bwplan,CUFFT_COMPATIBILITY_NATIVE);

I am working with cuFFT library practically doing convolutions. I can share my code, but I do not use these compatbility modes, just in-pace transforms. For the in-place transforms one has to use a real array of size H*(W+2).
So far in your code the only problem I see is the plan formation. If you have a matrix A[H][W] the plans should be cufftPlan2d(&fwplanA,H,W, CUFFT_R2C); Matlab and C have different layouts for the matrices.

I can post here the relevant parts of my code , but they have no CompatibilityMode

I can’t believe this… That solves the problem in my 2D simple test, I didn’t know that the plan has to be H,W instead of W,H… I have to do some full tests with the whole application using 3D convolutions before confirming, but I’m almost sure that’s the matter. Whole weeks of debugging just for this… Thank you so, so much!

Correct me if I’m wrong, but using CUFFT_COMPATIBILITY_NATIVE you can avoid the extra needed in in-place FFT, achieving better coalescence, but you lose FFTW layout compatibility.
My simple code use out-place, so they are irrelevant, but I need in-place transforms in my application.

From cuFFT doc:

Thank you very much, again!