CuFFT R2C 2D Batch Transforms Producing Incorrect Results

Hello,

When using the CuFFT library to perform 2D convolutions, I am experiencing several problems with the CuFFT library and it is only when I use incorrect values for idist and odist of the cufftPlanMany function that creates the R2C plan do I achieve expected results. When using the plans from cufftPlan2d, the results are still incorrect.

I have written sample code shown below where I attempted to batch-FFT two 2D inputs, perform an element-wise multiplication of the frequency-domain inputs, then IFFT the resultant for the 2D convolution in the spatial domain. However, I did not get the correct numbers when batching the FFT. I eventually found something that did work but at the cost of not being able to batch FFT’s. The code is currently in the configuration which works but should be relatively straightforward to reproduce the problem (I commented where to change). See bullets for additional troubleshooting steps I have taken:

• When using the 2D batch plan with batch=1, if idist and odist are the number of elements in a row of input/output memory (in respective units, i.e. for the R2C plan, idist = 2*odist because input is cufftReal and output is cufftComplex; this is opposite for C2R plan), it functions correctly. Note that I can’t batch because batch is set to 1.

• When using the same configuration as above with batch=2, it doesn’t seem to do anything with the second FFT. This is understandable, because the values of idist and odist are incorrect so there could be race conditions when writing results. I didn’t verify if the first FFT data was correct or not because of the second not working.

• When using the 2D batch plan with batch=2, if idist and odist are the correct values (number of elements in their respective units, so once again idist = 2* odist for the R2C plan and the opposite for C2R), it does not function correctly. It is this case that I expected everything to work from the start.

• When using a non-batch plan created by cufftPlan2d for R2C and C2R, the output is also not correct, and it is different from the previous case. This should have also worked so I was surprised when it didn’t.

Is there truly something wrong with the CuFFT library or am I just creating my plans incorrectly? If it’s the latter, I would appreciate the same code below to be returned with the batched R2C FFT’s being used and producing the correct numbers. When going through the manual it seems the correct way of doing things doesn’t actually produce the correct numbers.

Thanks for your help!

Chris

EDIT: I’m using Toolkit v5.0 64 bit; developing on Win7 with a Quadro K2000M

void CUDAFunctions::TestCuFFT(void)
	{
		//dimensions that force FFT size to be 16x16, which is power-of-two
		int a_d0 = 9;
		int a_d1 = 9;
		int b_d0 = 8;
		int b_d1 = 8;
		int fft_d0 = a_d0 + b_d0 - 1;
		int fft_d1 = a_d1 + b_d1 - 1;
		int half_d1_plus_one = fft_d1 / 2 + 1;

		//allocate host side
		float *A = (float*)malloc(a_d0 * a_d1 * sizeof(float));
		float *B = (float*)malloc(b_d0 * b_d1 * sizeof(float));
		float *result = (float*)malloc(fft_d0 * fft_d1 * sizeof(float));
		cufftComplex *debug = (cufftComplex*)malloc(fft_d0 * half_d1_plus_one * sizeof(cufftComplex));

		//allocate GPU and memset for zero-padding
		cufftComplex *d_A;
		cufftComplex *d_B;
		cufftComplex *d_A_result;
		cufftComplex *d_B_result;
		cufftComplex *d_convolution;
		cudaMalloc(&d_A, 2 * fft_d0 * half_d1_plus_one * sizeof(cufftComplex));
		d_B = &d_A[fft_d0 * half_d1_plus_one];
		cudaMalloc(&d_A_result, 2 * fft_d0 * half_d1_plus_one * sizeof(cufftComplex));
		d_B_result = &d_A_result[fft_d0 * half_d1_plus_one];
		cudaMalloc(&d_convolution, fft_d0 * half_d1_plus_one * sizeof(cufftComplex));
		cudaMemset(d_B, 0, fft_d0 * half_d1_plus_one * sizeof(cufftComplex));
		cudaMemset(d_A, 0, fft_d0 * half_d1_plus_one * sizeof(cufftComplex));
		cudaMemset(d_A_result, 0, fft_d0 * half_d1_plus_one * sizeof(cufftComplex));
		cudaMemset(d_B_result, 0, fft_d0 * half_d1_plus_one * sizeof(cufftComplex));
		cudaMemset(d_convolution, 0, fft_d0 * half_d1_plus_one * sizeof(cufftComplex));

		//create single plans
		cufftHandle plan_forward_single;
		cufftPlan2d(&plan_forward_single, fft_d0, fft_d1, CUFFT_R2C);
		cufftHandle plan_reverse_single;
		cufftPlan2d(&plan_reverse_single, fft_d0, fft_d1, CUFFT_C2R);

		//create batch plans
		int n[2] = {fft_d0, fft_d1};
		int inembed[2] = {fft_d0, half_d1_plus_one * sizeof(cufftComplex) / sizeof(cufftReal)};
		int onembed[2] = {fft_d0, half_d1_plus_one};
		cufftHandle plan_forward_many;
		cufftPlanMany(&plan_forward_many, 2, n,
			inembed, 1, half_d1_plus_one * sizeof(cufftComplex) / sizeof(cufftReal), //<--NOT MULTIPLYING idist BY fft_d0 IS INCORRECT FOR BATCHING BUT YIELDS CORRECT VALUES!!!
			onembed, 1, half_d1_plus_one, //<--NOT MULTIPLYING odist BY fft_d0 IS INCORRECT FOR BATCHING BUT YIELDS CORRECT VALUES!!!
			CUFFT_R2C, 1); //<--CHANGE TO 2 IF BATCHING IS DESIRED
		cufftHandle plan_reverse_many;
		inembed[1] = half_d1_plus_one;
		onembed[1] = half_d1_plus_one * sizeof(cufftComplex) / sizeof(cufftReal);
		cufftPlanMany(&plan_reverse_many, 2, n,
			inembed, 1, half_d1_plus_one, //<--NOT MULTIPLYING idist BY fft_d0 IS INCORRECT FOR BATCHING BUT YIELDS CORRECT VALUES!!!
			onembed, 1, half_d1_plus_one * sizeof(cufftComplex) / sizeof(cufftReal), //<--NOT MULTIPLYING odist BY fft_d0 IS INCORRECT FOR BATCHING BUT YIELDS CORRECT VALUES!!!
			CUFFT_C2R, 1);

		//just in case...
		cufftSetCompatibilityMode(plan_forward_single, CUFFT_COMPATIBILITY_NATIVE);
		cufftSetCompatibilityMode(plan_reverse_single, CUFFT_COMPATIBILITY_NATIVE);
		cufftSetCompatibilityMode(plan_forward_many, CUFFT_COMPATIBILITY_NATIVE);
		cufftSetCompatibilityMode(plan_reverse_many, CUFFT_COMPATIBILITY_NATIVE);

		//fill A and B with ones, then copy to GPU; the cudamemsets from before makes it zero-padded
		for(int i = 0; i < (a_d0 * a_d1); ++i)
			A[i] = 1.0f;
		for(int i = 0; i < (b_d0 * b_d1); ++i)
			B[i] = 1.0f;
		cudaMemcpy2D(d_A, half_d1_plus_one * sizeof(cufftComplex), A, a_d1 * sizeof(float), a_d1 * sizeof(float), a_d0, cudaMemcpyHostToDevice);
		cudaMemcpy2D(d_B, half_d1_plus_one * sizeof(cufftComplex), B, b_d1 * sizeof(float), b_d1 * sizeof(float), b_d0, cudaMemcpyHostToDevice);

		//do FFT
		cufftExecR2C(plan_forward_many, (cufftReal*)d_A, d_A_result);
		cufftExecR2C(plan_forward_many, (cufftReal*)d_B, d_B_result); //<--COMMENT THIS LINE IF BATCHING IS DESIRED
		
		//copy to host and dump (compare to MATLAB output for check of FFT results)
		cudaMemcpy(debug, d_A_result, fft_d0 * half_d1_plus_one * sizeof(cufftComplex), cudaMemcpyDeviceToHost);
		for(int i = 0; i < (fft_d0 * half_d1_plus_one); ++i)
			printf("Element %i of F(A) (single plan) is %f + %f i\n", i, debug[i].x, debug[i].y);
		cudaMemcpy(debug, d_B_result, fft_d0 * half_d1_plus_one * sizeof(cufftComplex), cudaMemcpyDeviceToHost);
		for(int i = 0; i < (fft_d0 * half_d1_plus_one); ++i)
			printf("Element %i of F(B) (single plan) is %f + %f i\n", i, debug[i].x, debug[i].y);

		//perform convolution in frequency domain; this also normalizes the fft's
		dim3 block = dim3(16, 16, 1);
		dim3 grid = dim3((fft_d0 + block.x - 1) / block.x, (fft_d1 + block.y - 1) / block.y, 1);
		cuElementWiseMultiplication<<<grid, block>>>(d_A_result, d_B_result, d_convolution, fft_d0, fft_d1);
		cudaDeviceSynchronize();

		//do IFFT
		cufftExecC2R(plan_reverse_many, d_convolution, (cufftReal*)d_convolution);

		//copy result (compare this with MATLAB but convolving matrices of all one's makes it easy to tell if it worked; first and last 16 elements should be 1,2,3,4,5,6,7,8,8,7,6,5,4,3,2,1)
		cudaMemcpy2D(result, fft_d1 * sizeof(float), d_convolution, half_d1_plus_one * sizeof(cufftComplex), fft_d1 * sizeof(float), fft_d0, cudaMemcpyDeviceToHost);
		for(int i = 0; i < (fft_d0 * fft_d1); ++i)
			printf("Element %i of result is %f\n", i, result[i]);

		//de-allocate everything
		free(A);
		free(B);
		free(result);
		free(debug);
		cudaFree(d_A);
		cudaFree(d_B);
		cudaFree(d_A_result);
		cudaFree(d_B_result);
		cudaFree(d_convolution);
		cufftDestroy(plan_forward_many);
		cufftDestroy(plan_reverse_many);
		cufftDestroy(plan_forward_single);
		cufftDestroy(plan_reverse_single);
	}

	__global__ void cuElementWiseMultiplication(cufftComplex *A, cufftComplex *B, cufftComplex *C, int dim0, int dim1)
	{
		int tidx = threadIdx.x + blockIdx.x * blockDim.x;
		int tidy = threadIdx.y + blockIdx.y * blockDim.y;
		if(tidx >= dim1 || tidy >= dim0){return;}
		int idx = tidx + tidy * dim1;
		cufftComplex a = A[idx];
		cufftComplex b = B[idx];
		cufftComplex c = make_float2(a.x * b.x - a.y * b.y, a.x * b.y + a.y * b.x);
		float numElements = (float)(dim0 * dim1);
		c.x /= numElements; //normalize
		c.y /= numElements; //normalize
		C[idx] = c;
	}

Does it still persist in CUDA 5.5 RC? If this problem persists, can you please file a bug report?

Login at https://developer.nvidia.com/user/login
Click link “CUDA/GPU Computing Registered Developer Program”
Click link “The Submit a Bug Form”