CUFFT issue

I’m having an issue with CUFFT that has kept me stuck for a while now, it took me ages to track it down but it would seem that the forward FFT is giving me a bad result. I’m not sure whether this is a hardware error, an error in the CUFFT library or maybe just my misuse of the library. I’ve got some code which should highlight the problem which occurs after doing successive FFTs.

[codebox]#include <stdio.h>

#include <cutil_inline.h>

#include <cufft.h>

int main(void)

{

int n = 1 << 22;

int nx = 1 << 12;

int g = 1000;

int infd = 0;

FILE *fp;

float pf[4] = {-3, -1, 1, 3};

float2 *d_v = 0;

cutilSafeCall(cudaMalloc((void **)&d_v, n*sizeof(float2)));

float2 *h_v = 0;

h_v = (float2 *)malloc(n*sizeof(float2));

float *h_w = 0;

h_w = (float *)malloc(n*sizeof(float));

cufftHandle plan;

cufftSafeCall(cufftPlan1d(&plan, n/nx, CUFFT_C2C, nx));

if ((fp = fopen(“ffterror”, “w”)) == NULL)

{

    fprintf(stderr, "fopen failed\n");

    return 1;

}

for (int p = 0; p < g; p++)

{

    printf("\b\b\b\b\b\b\b\b\b\b");

    printf("Loop: %d", p+1);

    fflush(stdout);

for (int i = 0; i < n; i++)

    {

        h_v[i].x = pf[rand()%4];

        h_v[i].y = 0;

    }

cutilSafeCall(cudaMemcpy(d_v, h_v, n*sizeof(float2), cudaMemcpyHostToDevice));

    cufftSafeCall(cufftExecC2C(plan, (cufftComplex *)d_v, (cufftComplex *)d_v, CUFFT_FORWARD));

    cutilSafeCall(cudaMemcpy(h_v, d_v, n*sizeof(float2), cudaMemcpyDeviceToHost));

for (int i = 0; i < n; i++)

    {

        h_w[i] = sqrt(h_v[i].x*h_v[i].x + h_v[i].y*h_v[i].y);

if (isinf(h_w[i]))

            infd = 1;

    }

if (infd)

    {

        for (int i = 0; i < n; i++)

        {

            fprintf(fp, "%f\t%f\t%f\n", h_v[i].x, h_v[i].y, h_w[i]);

        }

printf(“\ninf detected\n”);

return 1;

    }

}

printf(“\n”);

cufftDestroy(plan);

cudaFree(d_v);

free(h_v);

free(h_w);

}[/codebox]

To test for a bad result I’m taking the absolute value of the complex numbers. It’s just by luck that some of the numbers, when wrong, are large enough to make the absolute value ‘inf’ which makes it easier to check for errors. I’ve never had a problem with the FFT from the first loop (although I can’t say it will always be correct) but I have had a bad result from the FFT as early as in the third loop. The output file contains the complex values and the absolute value of the FFT in the loop when an ‘inf’ was detected, if I search for that inf I see something like this:

[codebox]…

0.822151 5.821064 5.878837

-14.281403 -35.493328 38.258789

28.915157 74.496826 79.911598

-68.777946 -34.812363 77.086357

8624051719469774914901966097219584.000000 102.752304 inf

-28.906429 -13.132529 31.749723

68.554565 37.719254 78.246216

38.730705 -39.402470 55.250538

-20.291538 28.090515 34.652901

…[/codebox]

I don’t think 8.6241e+33 is a good result, but I always seem to get an error like this within a couple of hundred loops (but more likely much earlier). I do have a few questions about the CUFFT library that I haven’t been able to find an answer to.

I read that there is a maximum transform size of 8 million, is that actually 8000000 or 2^23 = 8388608?

Is there a maximum number of batches for the transform?

Here’s the output from deviceQuery in case it’s any help:

[codebox]CUDA Device Query (Runtime API) version (CUDART static linking)

There is 1 device supporting CUDA

Device 0: “GeForce 9800 GT”

CUDA Capability Major revision number: 1

CUDA Capability Minor revision number: 1

Total amount of global memory: 536150016 bytes

Number of multiprocessors: 14

Number of cores: 112

Total amount of constant memory: 65536 bytes

Total amount of shared memory per block: 16384 bytes

Total number of registers available per block: 8192

Warp size: 32

Maximum number of threads per block: 512

Maximum sizes of each dimension of a block: 512 x 512 x 64

Maximum sizes of each dimension of a grid: 65535 x 65535 x 1

Maximum memory pitch: 262144 bytes

Texture alignment: 256 bytes

Clock rate: 1.50 GHz

Concurrent copy and execution: Yes

Run time limit on kernels: Yes

Integrated: No

Support host page-locked memory mapping: No

Compute mode: Default (multiple host threads can use this device simultaneously)

Test PASSED

Press ENTER to exit…[/codebox]

If anyone can see what I’m doing wrong or offer me any advice, I’d be very grateful. I’m hoping to try this code on another card at some stage to see if the problem still exists.