Hello,

I’m trying to perform a 2D convolution using the “FFT + point_wise_product + iFFT” aproach. Using NxN matrices the method goes well, however, with non square matrices the results are not correct. I’ve read the whole cuFFT documentation looking for any note about the behavior with this kind of matrices, tested in-place and out-place FFT, but I’m forgetting something.

I’ve tested the same algorithm with the same matrices in MATLAB and everthing is correct.

I show you a very simplified code, with a very basic kernel for clarity, its output and the expected output. What am I doing wrong? Thank you very much in advance.

```
const int W = 5;
const int H = 4;
//Signal, with just one value, for simplicity.
float A[H][W] = {{0,0,0,0,0},
{0,1,0,0,0},
{0,0,0,0,0},
{0,0,0,0,0}};
//Central element of the kernel in the (0,0) position of the array.
float B[H][W] = {{0.5, 0.1, 0, 0, 0.2},
{0 , 0 , 0, 0, 0},
{0 , 0 , 0, 0, 0},
{0 , 0 , 0, 0, 0}};
cufftReal* d_inA, *d_inB;
cufftComplex* d_outA, *d_outB;
size_t real_size = W * H * sizeof(cufftReal);
size_t complex_size = W * (H/2+1) * sizeof(cufftComplex);
cudaMalloc((void**)&d_inA, real_size);
cudaMalloc((void**)&d_inB, real_size);
cudaMalloc((void**)&d_outA, complex_size);
cudaMalloc((void**)&d_outB, complex_size);
cudaMemset(d_inA,0,real_size);
cudaMemset(d_inB,0,real_size);
cudaMemcpy(d_inA, A, real_size, cudaMemcpyHostToDevice);
cudaMemcpy(d_inB, B, real_size, cudaMemcpyHostToDevice);
cufftHandle fwplanA, fwplanB, bwplan;
cufftPlan2d(&fwplanA, W, H, CUFFT_R2C);
cufftPlan2d(&fwplanB, W, H, CUFFT_R2C);
cufftPlan2d(&bwplan, W, H, CUFFT_C2R);
cufftSetCompatibilityMode(fwplanA,CUFFT_COMPATIBILITY_NATIVE);
cufftSetCompatibilityMode(fwplanB,CUFFT_COMPATIBILITY_NATIVE);
cufftSetCompatibilityMode(bwplan,CUFFT_COMPATIBILITY_NATIVE);
cufftExecR2C(fwplanA, d_inA, d_outA);
cufftExecR2C(fwplanB, d_inB, d_outB);
int blocksx = ceil((W*(H/2+1 )) / 256.0f);
dim3 threads(256);
dim3 grid(blocksx);
// One complex product for each thread, scaled by the inverse of the
// number of elements involved in the FFT
pointwise_product<<<grid, threads>>>(d_outA, d_outB, (W*(H/2+1)), 1.0f/(W*H));
cufftExecC2R(bwplan, d_outA, d_inA);
cufftReal* result = new cufftReal[W*2*(H/2+1)];
cudaMemcpy(result, d_inA, real_size,cudaMemcpyDeviceToHost);
// Print result...
// Free memory...
```

Output. Note the displaced value

```
-0.0 0.0 -0.0 -0.0 0.0
0.0 <b>0.5 0.1</b> 0.0 -0.0
<b>0.2</b> 0.0 0.0 -0.0 -0.0
-0.0 0.0 -0.0 -0.0 -0.0
```

Expected output (MATLAB)

```
0 0 0 0 0
0.2000 0.5000 0.1000 0.0000 0.0000
0 0 0 0 0
0 0 0 0 0
```