CUFFT issue

cou202 · December 29, 2009, 5:12am

I’m having an issue with CUFFT that has kept me stuck for a while now, it took me ages to track it down but it would seem that the forward FFT is giving me a bad result. I’m not sure whether this is a hardware error, an error in the CUFFT library or maybe just my misuse of the library. I’ve got some code which should highlight the problem which occurs after doing successive FFTs.

[codebox]#include <stdio.h>

#include <cutil_inline.h>

#include <cufft.h>

int main(void)

{

int n = 1 << 22;

int nx = 1 << 12;

int g = 1000;

int infd = 0;

FILE *fp;

float pf[4] = {-3, -1, 1, 3};

float2 *d_v = 0;

cutilSafeCall(cudaMalloc((void **)&d_v, n*sizeof(float2)));

float2 *h_v = 0;

h_v = (float2 *)malloc(n*sizeof(float2));

float *h_w = 0;

h_w = (float *)malloc(n*sizeof(float));

cufftHandle plan;

cufftSafeCall(cufftPlan1d(&plan, n/nx, CUFFT_C2C, nx));

if ((fp = fopen(“ffterror”, “w”)) == NULL)

{

    fprintf(stderr, "fopen failed\n");

    return 1;

}

for (int p = 0; p < g; p++)

{

    printf("\b\b\b\b\b\b\b\b\b\b");

    printf("Loop: %d", p+1);

    fflush(stdout);

for (int i = 0; i < n; i++)

    {

        h_v[i].x = pf[rand()%4];

        h_v[i].y = 0;

    }

cutilSafeCall(cudaMemcpy(d_v, h_v, n*sizeof(float2), cudaMemcpyHostToDevice));

    cufftSafeCall(cufftExecC2C(plan, (cufftComplex *)d_v, (cufftComplex *)d_v, CUFFT_FORWARD));

    cutilSafeCall(cudaMemcpy(h_v, d_v, n*sizeof(float2), cudaMemcpyDeviceToHost));

for (int i = 0; i < n; i++)

    {

        h_w[i] = sqrt(h_v[i].x*h_v[i].x + h_v[i].y*h_v[i].y);

if (isinf(h_w[i]))

            infd = 1;

    }

if (infd)

    {

        for (int i = 0; i < n; i++)

        {

            fprintf(fp, "%f\t%f\t%f\n", h_v[i].x, h_v[i].y, h_w[i]);

        }

printf(“\ninf detected\n”);

return 1;

    }

}

printf(“\n”);

cufftDestroy(plan);

cudaFree(d_v);

free(h_v);

free(h_w);

}[/codebox]

To test for a bad result I’m taking the absolute value of the complex numbers. It’s just by luck that some of the numbers, when wrong, are large enough to make the absolute value ‘inf’ which makes it easier to check for errors. I’ve never had a problem with the FFT from the first loop (although I can’t say it will always be correct) but I have had a bad result from the FFT as early as in the third loop. The output file contains the complex values and the absolute value of the FFT in the loop when an ‘inf’ was detected, if I search for that inf I see something like this:

[codebox]…

0.822151 5.821064 5.878837

-14.281403 -35.493328 38.258789

28.915157 74.496826 79.911598

-68.777946 -34.812363 77.086357

8624051719469774914901966097219584.000000 102.752304 inf

-28.906429 -13.132529 31.749723

68.554565 37.719254 78.246216

38.730705 -39.402470 55.250538

-20.291538 28.090515 34.652901

…[/codebox]

I don’t think 8.6241e+33 is a good result, but I always seem to get an error like this within a couple of hundred loops (but more likely much earlier). I do have a few questions about the CUFFT library that I haven’t been able to find an answer to.

I read that there is a maximum transform size of 8 million, is that actually 8000000 or 2^23 = 8388608?

Is there a maximum number of batches for the transform?

Here’s the output from deviceQuery in case it’s any help:

[codebox]CUDA Device Query (Runtime API) version (CUDART static linking)

There is 1 device supporting CUDA

Device 0: “GeForce 9800 GT”

CUDA Capability Major revision number: 1

CUDA Capability Minor revision number: 1

Total amount of global memory: 536150016 bytes

Number of multiprocessors: 14

Number of cores: 112

Total amount of constant memory: 65536 bytes

Total amount of shared memory per block: 16384 bytes

Total number of registers available per block: 8192

Warp size: 32

Maximum number of threads per block: 512

Maximum sizes of each dimension of a block: 512 x 512 x 64

Maximum sizes of each dimension of a grid: 65535 x 65535 x 1

Maximum memory pitch: 262144 bytes

Texture alignment: 256 bytes

Clock rate: 1.50 GHz

Concurrent copy and execution: Yes

Run time limit on kernels: Yes

Integrated: No

Support host page-locked memory mapping: No

Compute mode: Default (multiple host threads can use this device simultaneously)

Test PASSED

Press ENTER to exit…[/codebox]

If anyone can see what I’m doing wrong or offer me any advice, I’d be very grateful. I’m hoping to try this code on another card at some stage to see if the problem still exists.

Topic		Replies	Views
CUDA 2.2 and failing CUFFT SDK example CUDA Programming and Performance	3	10569	June 9, 2009
Bad Performance of CUFFT library? compilation flags for optimizing fft performance CUDA Programming and Performance	11	13486	February 17, 2012
CUFFT_INTERNAL_ERROR Error while executing cufftPlan2d() CUDA Programming and Performance	3	7923	July 13, 2008
cufft: ERROR: CUFFT_INVALID_PLAN CUDA Programming and Performance	2	2260	June 2, 2007
Poor CUFFT Performance? Am I doing something wrong? CUDA Programming and Performance	15	15485	May 4, 2010
cuFFT 2^15+ issues? GPU-Accelerated Libraries	3	1804	January 2, 2013
Benchmarking Paricular Sized CUFFT I have a CUFFT, and I can't seem to get anywhere near optimal CUDA Programming and Performance	0	2211	April 27, 2009
CUFFT appears to give errors for vectors > 1024 CUDA Programming and Performance	6	8764	April 12, 2007
CUFFT run wrong CUDA Programming and Performance	16	2806	May 23, 2013
Memory leak in cuFFT (cuda 5.0)? GPU-Accelerated Libraries	8	3580	January 27, 2013

CUFFT issue

Related topics