cufftExecC2C incorrect for certain FFT sizes

First off - I apologize that my first post has to be a question. I visit the forums frequently but have come across an issue that has me scratching my head.

I have a large CUDA application and at one point it calculates the inverse FFT for a set of data. Comparing this output to FFTW (for example) produces drastically different results, but ONLY for an FFT size of 32k. Unfortunately I cannot post the entire code, but here’s the gist:

#define FFT_SIZE 32768

cufftComplex *input = (cufftComplex *)malloc(sizeof(cufftComplex) * FFT_SIZE);

cufftComplex *input_d;

cufftHandle plan;

cufftPlan1d(&plan, FFT_SIZE, CUFFT_C2C, 1);

cudaMalloc((void **)&input_d, sizeof(cufftComplex) * FFT_SIZE);

// Generate the input data

cudaMemcpy(input_d, input, sizeof(cufftComplex) * FFT_SIZE, cudaMemcpyHostToDevice);

cufftExecC2C(plan, input_d, input_d, CUFFT_FORWARD);

// Verified data looks good here.

int nThreads = 512;

int nBlocks = (FFT_SIZE / 2) / nThreads + (((FFT_SIZE / 2) % nThreads) == 0 ? 0 : 1);

kernelFunc<<< nBlocks, nThreads >>>(FFT_SIZE, input_d);

// Verified data looks good here.

cufftExecC2C(plan, input_d, input_d, CUFFT_INVERSE);

// Data is completely wrong here.

In my main application if I change the FFT_SIZE to 1k, 2k, 4k, 8k, or 16k the output after the inverse FFT is perfect. Its only for an FFT_SIZE of 32k where the data is wrong.

I should also add that I tried this sample code outside of my environment and it worked for all FFT sizes. So there seems to be something within my main application environment that is causing the error. I’ve been looking at this for a few days and am clueless.

If anyone has any thoughts I would greatly appreciate it!

The error is probably in the nBlocks assignment or in the kernelFunc.

Thanks mfatica, I have a feeling you’re right and I was really hoping that wasn’t the issue :)

I quickly commented out the first cufftExecC2C and the kernelFunc and that didn’t seem to help. I have 4 other kernel functions above that first cufftExecC2C function, so it looks like I’ve got a bit more debugging to do.

Thanks for the direction.

Is there a reason for doing FFT_SIZE/2 while configuring the kernel launch configuration. You could comment out the kernelFunc launch and see if FFT followed by IFFT generates input_d (scaling maybe needed).

You could also use ArrayFire (which is free) to do FFTs in one line as shown below:

array A = randu(3,4) // random data generation on GPU

   array B = fft(A);    // FFT



A =

        0.7402     0.9690     0.6673     0.5132

        0.9210     0.9251     0.1099     0.7762

        0.0390     0.4464     0.4702     0.2948

B =

        1.7002 +    0.0000i    2.3405 +    0.0000i    1.2475 +    0.0000i    1.5841 +    0.0000i

        0.2602 -    0.7638i    0.2832 -    0.4146i    0.3772 +    0.3120i   -0.0223 -    0.4169i

        0.2602 +    0.7638i    0.2832 +    0.4146i    0.3772 -    0.3120i   -0.0223 +    0.4169i

Thanks short. I’ve stepped away from this for a few days to help clear my head. I’ll give ArrayFire a shot.

Commenting out the kernelFunc did not seem to help. However, commenting out the first cufftExecC2C (contradictory to what I said in my second post, I made some other changes) DID in fact produce the correct results. I’m not sure why that is the case. I’ve tried copying the data to different buffers all without any success.

I’m starting to go back into the code above all of this. Unfortunately, each function depends on the output from the previous one! So trying to get meaningful data down to this point has been challenging.

Thanks again for the help.

Have to set the FFTW compatibility flag to FFT_COMPATIBLE_ALL (or something like that) since you are checking correctness with FFTW?