cuFFT and fftw

Hello,

I’m hoping someone can point me in the right direction on what is happening. I have three code samples, one using fftw3, the other two using cufft. My fftw example uses the real2complex functions to perform the fft. My cufft equivalent does not work, but if I manually fill a complex array the complex2complex works. Here are some code samples:

float *ptr is the array holding a 2d image which is my test case of size w, h. I apply a fft along the width.

[codebox]

    cufftHandle plan;

    cufftPlanMany (&plan, 1, &w, NULL, 1, 0, NULL, 1, 0, CUFFT_C2C, h);

cufftComplex *devin;

    cudaMalloc ((void**)&devin, sizeof (cufftComplex)*w*h);

cufftComplex *devout;

    cudaMalloc ((void**)&devout, sizeof (cufftComplex)*w*h);

cufftComplex hostd = new cufftComplex[wh];

    for (int i = 0; i < h; i++)

    {

            for (int j = 0; j < w; j++)

            {

                    hostd[i*w + j].x = ptr[i*w + j];

                    hostd[i*w + j].y = 0.f;

            }

    }

cudaMemcpy (devin, hostd, sizeof (cufftComplex)wh, cudaMemcpyHostToDevice);

printf ("-= Performing CUDA FFT forward =-\n");

    cufftExecC2C (plan, devin, devout, CUFFT_FORWARD);

cudaMemcpy (hostd, devout, sizeof (cufftComplex)wh, cudaMemcpyDeviceToHost);

for (int i = 0; i < h; i++)

    {

            for (int j = 0; j < w; j++)

            {

                    hostd[i*w + j].x *= filter[j];

                    hostd[i*w + j].y *= filter[j];

            }

    }

delete filter;

cudaMemcpy (devin, hostd, sizeof (cufftComplex)wh, cudaMemcpyHostToDevice);

printf ("-= Performing CUDA FFT inverse =-\n");

    cufftExecC2C (plan, devin, devout, CUFFT_INVERSE);

cudaMemcpy (hostd, devout, sizeof (cufftComplex)wh, cudaMemcpyDeviceToHost);

for (int i = 0; i < h; i++)

    {

            for (int j = 0; j < w; j++)

            {

                    ptr[i*w + j] = hostd[i*w + j].x/w;

            }

    }

delete hostd;

cufftDestroy (plan);

    cudaFree (devout);

    cudaFree (devin);

[/codebox]

Same input values, except I create two plans, one for R2C, then C2R. This produces incorrect results.

[codebox]

    cufftHandle plan1, plan2;

    cufftPlanMany (&plan1, 1, &w, NULL, 1, 0, NULL, 1, 0, CUFFT_R2C, h);

    cufftPlanMany (&plan2, 1, &w, NULL, 1, 0, NULL, 1, 0, CUFFT_C2R, h);

float *devin;

    cudaMalloc ((void**)&devin, sizeof (float)*w*h);

cufftComplex *devout;

    cudaMalloc ((void**)&devout, sizeof (cufftComplex)*w*h);

cudaMemcpy (devin, ptr, sizeof (float)wh, cudaMemcpyHostToDevice);

printf ("-= Performing CUDA FFT forward =-\n");

    cufftExecR2C (plan1, devin, devout);

cufftComplex hostd = new cufftComplex[wh];

    cudaMemcpy (hostd, devout, sizeof (cufftComplex)*w*h, cudaMemcpyDeviceToHost);

for (int i = 0; i < h; i++)

    {

            for (int j = 0; j < w; j++)

            {

                    hostd[i*w + j].x *= filter[j];

                    hostd[i*w + j].y *= filter[j];

            }

    }

delete filter;

cudaMemcpy (devout, hostd, sizeof (cufftComplex)wh, cudaMemcpyHostToDevice);

printf ("-= Performing CUDA FFT inverse =-\n");

    cufftExecC2R (plan2, devout, devin);

cudaMemcpy (ptr, devin, sizeof (float)wh, cudaMemcpyDeviceToHost);

delete hostd;

cufftDestroy (plan2);

    cufftDestroy (plan1);

    cudaFree (devout);

    cudaFree (devin);

[/codebox]

I can provide the fftw equivalent if its relevant. The first version, C2C, works in producing the same look, but normalizes the values (which I think is caused by the divide by width when copying back to ptr). The fftw version does not perform this normalization. The second cufft version, R2C and C2R, does not work and it returns the image, unchanged as far as i can tell. The filter being applied should greatly change the way the image looks. Thanks for any assistance!

-brad

-edit Corrected memcpy so it shows copy from host to device after applying the filter correctly

It would be helpful to know which, if any, of the CUDA Runtime or CUFFT API calls are returning error codes. Also, which version of the CUDA Toolkit (including CUFFT) are you using?

Thanks,
Cliff

It would be helpful to know which, if any, of the CUDA Runtime or CUFFT API calls are returning error codes. Also, which version of the CUDA Toolkit (including CUFFT) are you using?

Thanks,
Cliff

Looks like your memcpy back to the gpu is copying to wrong array. Also, as an aside, I’m assuming that you copy the arrays back to the cpu to apply the filter for clarity. For performance, you should most definitely be doing everything on the gpu.

Looks like your memcpy back to the gpu is copying to wrong array. Also, as an aside, I’m assuming that you copy the arrays back to the cpu to apply the filter for clarity. For performance, you should most definitely be doing everything on the gpu.

Ah, yes, I agree. Good catch! Sorry I didn’t look closely enough at the code to spot those things.

PS: Even after fixing these issues, the code should still check for API errors. :)

–Cliff

Ah, yes, I agree. Good catch! Sorry I didn’t look closely enough at the code to spot those things.

PS: Even after fixing these issues, the code should still check for API errors. :)

–Cliff

Thanks for the replies!

I’ve been checking for errors, but for clarity I omitted them in my post. The only errors I have received have been numerous “Microsoft C++ exception: cudaError_enum at memory location 0x0012fb00…”, which only occurs on the first cuda call access.

I’m using the 3.0 toolkit on redhat linux 5.3, and on my vista laptop I was running 3.0 but upgraded today to the latest 3.1 toolkit and drivers.

Yes, I had planned to apply the filter in a kernel once I find out what I’m missing.

Thanks for the replies!

I’ve been checking for errors, but for clarity I omitted them in my post. The only errors I have received have been numerous “Microsoft C++ exception: cudaError_enum at memory location 0x0012fb00…”, which only occurs on the first cuda call access.

I’m using the 3.0 toolkit on redhat linux 5.3, and on my vista laptop I was running 3.0 but upgraded today to the latest 3.1 toolkit and drivers.

Yes, I had planned to apply the filter in a kernel once I find out what I’m missing.

I’ve added a kernel to perform the filtering, and since I made a large number of changes I will repost my code. I am using the same kernel in both the C2C and R2C->C2R versions.

[codebox]

    cufftHandle plan1, plan2;

CUFFT_CHECK (cufftPlanMany (&plan1, 1, &w, NULL, 1, 0, NULL, 1, 0, CUFFT_R2C, h));

CUFFT_CHECK (cufftPlanMany (&plan2, 1, &w, NULL, 1, 0, NULL, 1, 0, CUFFT_C2R, h));

float *devR, *devF;

CUDA_CHECK (cudaMalloc ((void**)&devR, sizeof (float)*w*h));

CUDA_CHECK (cudaMalloc ((void**)&devF, sizeof (float)*(w + 1)));

cufftComplex *devC;

CUDA_CHECK (cudaMalloc ((void**)&devC, sizeof (cufftComplex)*(w*h + 1)));

CUDA_CHECK (cudaMemcpy (devF, filter, sizeof (float)*(w + 1), cudaMemcpyHostToDevice));

CUDA_CHECK (cudaMemcpy (devR, ptr, sizeof (float)*w*h, cudaMemcpyHostToDevice));

printf ("-= Performing CUDA FFT forward =-\n");

CUFFT_CHECK (cufftExecR2C (plan1, devR, devC));

int block = 8;

runFilter (devC, devF, block, block, w, h);

CUDA_CHECK_ERROR ();

printf ("-= Performing CUDA FFT inverse =-\n");

CUFFT_CHECK (cufftExecC2R (plan2, devC, devR));

CUDA_CHECK (cudaMemcpy (ptr, devR, sizeof (float)*w*h, cudaMemcpyDeviceToHost));

CUFFT_CHECK (cufftDestroy (plan2));

CUFFT_CHECK (cufftDestroy (plan1));

CUDA_CHECK (cudaFree (devC));

CUDA_CHECK (cudaFree (devR));

CUDA_CHECK (cudaFree (devF));

[/codebox]

I am still getting odd results. It seems that using R2C and C2R is skipping lines. Here are 3 images to show the results I’m getting:

Original image: http://img827.imageshack.us/img827/4576/gr…024x1024000.png

C2C plan: http://img651.imageshack.us/img651/9572/gr…4x1024000ra.png

R2C C2R plan: http://img638.imageshack.us/img638/9572/gr…4x1024000ra.png

As the R2C plan shows, the image is performing some function on every other row. Every other line doesn’t look like its actually performing anything, in fact it is the same color. I don’t receive any API errors during execution. Maybe I’m not creating a large enough buffer for cufft?

I tried using the compatibility settings with cufftSetCompatibilityMode. Do I need to have a fermi card to run these options? ‘Native’ and ‘fftw-padding’ work, but ‘asymmetric’ and ‘all’ give me invalid plan handle errors.

Thanks again!

-brad

I’ve added a kernel to perform the filtering, and since I made a large number of changes I will repost my code. I am using the same kernel in both the C2C and R2C->C2R versions.

[codebox]

    cufftHandle plan1, plan2;

CUFFT_CHECK (cufftPlanMany (&plan1, 1, &w, NULL, 1, 0, NULL, 1, 0, CUFFT_R2C, h));

CUFFT_CHECK (cufftPlanMany (&plan2, 1, &w, NULL, 1, 0, NULL, 1, 0, CUFFT_C2R, h));

float *devR, *devF;

CUDA_CHECK (cudaMalloc ((void**)&devR, sizeof (float)*w*h));

CUDA_CHECK (cudaMalloc ((void**)&devF, sizeof (float)*(w + 1)));

cufftComplex *devC;

CUDA_CHECK (cudaMalloc ((void**)&devC, sizeof (cufftComplex)*(w*h + 1)));

CUDA_CHECK (cudaMemcpy (devF, filter, sizeof (float)*(w + 1), cudaMemcpyHostToDevice));

CUDA_CHECK (cudaMemcpy (devR, ptr, sizeof (float)*w*h, cudaMemcpyHostToDevice));

printf ("-= Performing CUDA FFT forward =-\n");

CUFFT_CHECK (cufftExecR2C (plan1, devR, devC));

int block = 8;

runFilter (devC, devF, block, block, w, h);

CUDA_CHECK_ERROR ();

printf ("-= Performing CUDA FFT inverse =-\n");

CUFFT_CHECK (cufftExecC2R (plan2, devC, devR));

CUDA_CHECK (cudaMemcpy (ptr, devR, sizeof (float)*w*h, cudaMemcpyDeviceToHost));

CUFFT_CHECK (cufftDestroy (plan2));

CUFFT_CHECK (cufftDestroy (plan1));

CUDA_CHECK (cudaFree (devC));

CUDA_CHECK (cudaFree (devR));

CUDA_CHECK (cudaFree (devF));

[/codebox]

I am still getting odd results. It seems that using R2C and C2R is skipping lines. Here are 3 images to show the results I’m getting:

Original image: http://img827.imageshack.us/img827/4576/gr…024x1024000.png

C2C plan: http://img651.imageshack.us/img651/9572/gr…4x1024000ra.png

R2C C2R plan: http://img638.imageshack.us/img638/9572/gr…4x1024000ra.png

As the R2C plan shows, the image is performing some function on every other row. Every other line doesn’t look like its actually performing anything, in fact it is the same color. I don’t receive any API errors during execution. Maybe I’m not creating a large enough buffer for cufft?

I tried using the compatibility settings with cufftSetCompatibilityMode. Do I need to have a fermi card to run these options? ‘Native’ and ‘fftw-padding’ work, but ‘asymmetric’ and ‘all’ give me invalid plan handle errors.

Thanks again!

-brad