Cufft gives different answer each run?

Hey, I’m running an algorithm where the cufft library is called multiple times on the same matrix (currently just for testing). I’m getting some weird results however, where cufft isn’t giving me the same answer each run. Any help would be greatly appreciated, here’s the code I’ve been using:

Main algorithm:

//-- width = height = 1024 in this algorithm.

//-- d_dose is a float array of size width*height

//-- d_doseDFT is a float2/cufftComplex array of size width*(height/2 + 1)

//-- d_kernelDFT is a float2/cufftComplex array of size width*(height/2 + 1)

//-- all 3 arrays were allocated using cudaMalloc so they are in device memory.

//-- R2CPlan was created using cufftPlan2d(&R2CPlan, width, height, CUFFT_R2C)

int NsqC = width*(height/2 + 1);

float2 tmp;

for (int iter=0; iter<5; iter++){

	// Convolution in frequency domain

	cudaMemset(d_doseDFT, 0, NsqC*sizeof(float2));

	tmp = sumComplex(d_doseDFT, NsqC);

	cout << "BEGIN d_doseDFT sum is:==============  " << tmp.x << "	" << tmp.y << endl;

			

	cufftExecR2C(R2CPlan, d_dose, d_doseDFT);

	cudaThreadSynchronize();

		

	tmp = sumComplex(d_doseDFT, NsqC);

	cout << "END d_doseDFT sum is:==============  " << tmp.x << "	" << tmp.y << endl;

	

	// Below statement is where things get weird, explained further in the forum post 

-->	cuArrayMulC(d_doseDFT, d_kernelDFT, d_doseDFT, NsqC);

	cudaThreadSynchronize();

	cudaMemset(d_doseDFT, 0, NsqC*sizeof(float2));

}

Some of the above functions were user defined, here’s their definitions:

float2 sumComplex(float2 *d_array, int size){

	float2 *array = (float2*) malloc(size*sizeof(float2));

	cudaMemcpy(array, d_array, size*sizeof(float2), cudaMemcpyDeviceToHost);

	cudaThreadSynchronize();

	float2 total;

	total.x = 0; total.y = 0;

	for (int i = 0; i < size; i++){

		total.x += abs(array[i].x);

		total.y += abs(array[i].y);

	}

	free(array);

	return total;

}

void cuArrayMulC(float2 *d_matrix1, float2 *d_matrix2, float2 *d_output, int size){

	g_arrayMulC<<<1+size/128, 128>>>(d_matrix1, d_matrix2, d_output, size);

}

__global__ void g_arrayMulC(float2 *matrix1, float2 *matrix2, float2 *output, int size){

	int i = blockIdx.x * blockDim.x + threadIdx.x;

	if ( i < size){

		float2 t;

		t.x = matrix1[i].x * matrix2[i].x - matrix1[i].y * matrix2[i].y;

		t.y = matrix1[i].x * matrix2[i].y + matrix1[i].y * matrix2[i].x;

		output[i] = t;		

	}

}

Here’s the results I encountered. The print statement with “BEGIN d_dose” always gave a sum of 0, which is to be expected. The print statement with “END d_dose” is different for each iteration of the for loop, which makes absolutely no sense. Hence, I said cufft is giving different answers each run.

The part that gets weirder is the line I’ve pointed out with ‘–>’ in the algorithm. If I comment out that line, then the “END d_dose” value IS the same for each iteration of the for loop. I don’t see how that line could affect the outcome of the cufft call at all considering there is a memset to 0 anyways before the cufft call???

Any ideas? I have no idea whats going on.