cuFFT 2D Convolution

jgonzac · April 3, 2014, 1:55pm

Hello,
I’m trying to perform a 2D convolution using the “FFT + point_wise_product + iFFT” aproach. Using NxN matrices the method goes well, however, with non square matrices the results are not correct. I’ve read the whole cuFFT documentation looking for any note about the behavior with this kind of matrices, tested in-place and out-place FFT, but I’m forgetting something.

I’ve tested the same algorithm with the same matrices in MATLAB and everthing is correct.
I show you a very simplified code, with a very basic kernel for clarity, its output and the expected output. What am I doing wrong? Thank you very much in advance.

const int W = 5;
const int H = 4;

//Signal, with just one value, for simplicity.
float A[H][W] = {{0,0,0,0,0},
		{0,1,0,0,0},
		{0,0,0,0,0},
		{0,0,0,0,0}};

//Central element of the kernel in the (0,0) position of the array.
float B[H][W] = {{0.5, 0.1,  0,    0,  0.2},
		{0  ,  0  ,  0,    0,  0},
		{0  ,  0  ,  0,    0,  0},
		{0  ,  0  ,  0,    0,  0}};


cufftReal* d_inA, *d_inB;
cufftComplex* d_outA, *d_outB;

size_t real_size = W * H * sizeof(cufftReal);
size_t complex_size = W * (H/2+1) * sizeof(cufftComplex);

cudaMalloc((void**)&d_inA, real_size);
cudaMalloc((void**)&d_inB, real_size);

cudaMalloc((void**)&d_outA, complex_size);
cudaMalloc((void**)&d_outB, complex_size);

cudaMemset(d_inA,0,real_size);
cudaMemset(d_inB,0,real_size);

cudaMemcpy(d_inA, A, real_size,	cudaMemcpyHostToDevice);
cudaMemcpy(d_inB, B, real_size,	cudaMemcpyHostToDevice);


cufftHandle fwplanA, fwplanB, bwplan;
cufftPlan2d(&fwplanA, W, H, CUFFT_R2C);
cufftPlan2d(&fwplanB, W, H, CUFFT_R2C);
cufftPlan2d(&bwplan, W, H, CUFFT_C2R);

cufftSetCompatibilityMode(fwplanA,CUFFT_COMPATIBILITY_NATIVE);
cufftSetCompatibilityMode(fwplanB,CUFFT_COMPATIBILITY_NATIVE);
cufftSetCompatibilityMode(bwplan,CUFFT_COMPATIBILITY_NATIVE);	

cufftExecR2C(fwplanA, d_inA, d_outA);
cufftExecR2C(fwplanB, d_inB, d_outB);

int blocksx = ceil((W*(H/2+1 )) / 256.0f);
dim3 threads(256);
dim3 grid(blocksx);
// One complex product for each thread, scaled by the inverse of the
// number of elements involved in the FFT
pointwise_product<<<grid, threads>>>(d_outA, d_outB, (W*(H/2+1)), 1.0f/(W*H));

cufftExecC2R(bwplan, d_outA, d_inA);


cufftReal* result = new cufftReal[W*2*(H/2+1)];
cudaMemcpy(result, d_inA, real_size,cudaMemcpyDeviceToHost);

// Print result...
// Free memory...

Output. Note the displaced value

-0.0	0.0	-0.0	-0.0	0.0	
0.0	<b>0.5	0.1</b>	0.0	-0.0	
<b>0.2</b>	0.0	0.0	-0.0	-0.0	
-0.0	0.0	-0.0	-0.0	-0.0

Expected output (MATLAB)

0         0         0         0         0
0.2000    0.5000    0.1000    0.0000    0.0000
0         0         0         0         0
0         0         0         0         0

pasoleatis · April 3, 2014, 2:54pm

What do these lines do?

cufftSetCompatibilityMode(fwplanA,CUFFT_COMPATIBILITY_NATIVE);
cufftSetCompatibilityMode(fwplanB,CUFFT_COMPATIBILITY_NATIVE);
cufftSetCompatibilityMode(bwplan,CUFFT_COMPATIBILITY_NATIVE);

I am working with cuFFT library practically doing convolutions. I can share my code, but I do not use these compatbility modes, just in-pace transforms. For the in-place transforms one has to use a real array of size H*(W+2).
So far in your code the only problem I see is the plan formation. If you have a matrix A[H][W] the plans should be cufftPlan2d(&fwplanA,H,W, CUFFT_R2C); Matlab and C have different layouts for the matrices.

I can post here the relevant parts of my code , but they have no CompatibilityMode

jgonzac · April 3, 2014, 3:45pm

I can’t believe this… That solves the problem in my 2D simple test, I didn’t know that the plan has to be H,W instead of W,H… I have to do some full tests with the whole application using 3D convolutions before confirming, but I’m almost sure that’s the matter. Whole weeks of debugging just for this… Thank you so, so much!

Correct me if I’m wrong, but using CUFFT_COMPATIBILITY_NATIVE you can avoid the extra needed in in-place FFT, achieving better coalescence, but you lose FFTW layout compatibility.
My simple code use out-place, so they are irrelevant, but I need in-place transforms in my application.

From cuFFT doc:

Thank you very much, again!

Topic		Replies	Views
cufft questions CUDA Programming and Performance	5	1520	April 2, 2012
in-place FFT fails on 2D real transform CUDA Programming and Performance	0	848	August 13, 2009
CUDA FFT different from Matlab FFT CUDA Programming and Performance	32	9566	March 29, 2011
Periodically discrepancies between Cufft and FFTW - output GPU-Accelerated Libraries	1	1588	January 5, 2013
Getting Segmentation fault for higher sized input arrays for 2d CuFFT application. CUDA Programming and Performance	15	3019	July 10, 2014
Buggy CUFFT inplace 2D R2C? CUDA Programming and Performance	0	3774	February 20, 2008
cuFFT and FFTW ciscrepancies GPU-Accelerated Libraries	1	1371	February 5, 2013
Why CUFFT_COMPATIBILITY_FFTW_ALL? Wrong behavior with 2D batched FT CUDA Programming and Performance	1	814	December 2, 2011
cuFFT and fftw CUDA Programming and Performance	10	4288	August 25, 2010
Differences between cufft plan2d matlab fft2? CUDA Programming and Performance	2	3005	July 20, 2009

cuFFT 2D Convolution

Related topics