cuFFT and fftw

galapaegos · August 24, 2010, 9:13pm

Hello,

I’m hoping someone can point me in the right direction on what is happening. I have three code samples, one using fftw3, the other two using cufft. My fftw example uses the real2complex functions to perform the fft. My cufft equivalent does not work, but if I manually fill a complex array the complex2complex works. Here are some code samples:

float *ptr is the array holding a 2d image which is my test case of size w, h. I apply a fft along the width.

[codebox]

    cufftHandle plan;

    cufftPlanMany (&plan, 1, &w, NULL, 1, 0, NULL, 1, 0, CUFFT_C2C, h);

cufftComplex *devin;

    cudaMalloc ((void**)&devin, sizeof (cufftComplex)*w*h);

cufftComplex *devout;

    cudaMalloc ((void**)&devout, sizeof (cufftComplex)*w*h);

cufftComplex hostd = new cufftComplex[wh];

    for (int i = 0; i < h; i++)

    {

            for (int j = 0; j < w; j++)

            {

                    hostd[i*w + j].x = ptr[i*w + j];

                    hostd[i*w + j].y = 0.f;

            }

    }

cudaMemcpy (devin, hostd, sizeof (cufftComplex)wh, cudaMemcpyHostToDevice);

printf (“-= Performing CUDA FFT forward =-\n”);

    cufftExecC2C (plan, devin, devout, CUFFT_FORWARD);

cudaMemcpy (hostd, devout, sizeof (cufftComplex)wh, cudaMemcpyDeviceToHost);

for (int i = 0; i < h; i++)

    {

            for (int j = 0; j < w; j++)

            {

                    hostd[i*w + j].x *= filter[j];

                    hostd[i*w + j].y *= filter[j];

            }

    }

delete filter;

cudaMemcpy (devin, hostd, sizeof (cufftComplex)wh, cudaMemcpyHostToDevice);

printf (“-= Performing CUDA FFT inverse =-\n”);

    cufftExecC2C (plan, devin, devout, CUFFT_INVERSE);

cudaMemcpy (hostd, devout, sizeof (cufftComplex)wh, cudaMemcpyDeviceToHost);

for (int i = 0; i < h; i++)

    {

            for (int j = 0; j < w; j++)

            {

                    ptr[i*w + j] = hostd[i*w + j].x/w;

            }

    }

delete hostd;

cufftDestroy (plan);

    cudaFree (devout);

    cudaFree (devin);

[/codebox]

Same input values, except I create two plans, one for R2C, then C2R. This produces incorrect results.

[codebox]

    cufftHandle plan1, plan2;

    cufftPlanMany (&plan1, 1, &w, NULL, 1, 0, NULL, 1, 0, CUFFT_R2C, h);

    cufftPlanMany (&plan2, 1, &w, NULL, 1, 0, NULL, 1, 0, CUFFT_C2R, h);

float *devin;

    cudaMalloc ((void**)&devin, sizeof (float)*w*h);

cufftComplex *devout;

    cudaMalloc ((void**)&devout, sizeof (cufftComplex)*w*h);

cudaMemcpy (devin, ptr, sizeof (float)wh, cudaMemcpyHostToDevice);

printf (“-= Performing CUDA FFT forward =-\n”);

    cufftExecR2C (plan1, devin, devout);

cufftComplex hostd = new cufftComplex[wh];

    cudaMemcpy (hostd, devout, sizeof (cufftComplex)*w*h, cudaMemcpyDeviceToHost);

for (int i = 0; i < h; i++)

    {

            for (int j = 0; j < w; j++)

            {

                    hostd[i*w + j].x *= filter[j];

                    hostd[i*w + j].y *= filter[j];

            }

    }

delete filter;

cudaMemcpy (devout, hostd, sizeof (cufftComplex)wh, cudaMemcpyHostToDevice);

printf (“-= Performing CUDA FFT inverse =-\n”);

    cufftExecC2R (plan2, devout, devin);

cudaMemcpy (ptr, devin, sizeof (float)wh, cudaMemcpyDeviceToHost);

delete hostd;

cufftDestroy (plan2);

    cufftDestroy (plan1);

    cudaFree (devout);

    cudaFree (devin);

[/codebox]

I can provide the fftw equivalent if its relevant. The first version, C2C, works in producing the same look, but normalizes the values (which I think is caused by the divide by width when copying back to ptr). The fftw version does not perform this normalization. The second cufft version, R2C and C2R, does not work and it returns the image, unchanged as far as i can tell. The filter being applied should greatly change the way the image looks. Thanks for any assistance!

-brad

-edit Corrected memcpy so it shows copy from host to device after applying the filter correctly

Cliff_Woolley · August 24, 2010, 9:21pm

It would be helpful to know which, if any, of the CUDA Runtime or CUFFT API calls are returning error codes. Also, which version of the CUDA Toolkit (including CUFFT) are you using?

Thanks,
Cliff

Cliff_Woolley · August 24, 2010, 9:21pm

It would be helpful to know which, if any, of the CUDA Runtime or CUFFT API calls are returning error codes. Also, which version of the CUDA Toolkit (including CUFFT) are you using?

Thanks,
Cliff

eelsen · August 24, 2010, 9:23pm

Hello,

I’m hoping someone can point me in the right direction on what is happening. I have three code samples, one using fftw3, the other two using cufft. My fftw example uses the real2complex functions to perform the fft. My cufft equivalent does not work, but if I manually fill a complex array the complex2complex works. Here are some code samples:

float *ptr is the array holding a 2d image which is my test case of size w, h. I apply a fft along the width.

[codebox]
    cufftHandle plan;

    cufftPlanMany (&plan, 1, &w, NULL, 1, 0, NULL, 1, 0, CUFFT_C2C, h);
cufftComplex *devin;
    cudaMalloc ((void**)&devin, sizeof (cufftComplex)*w*h);
cufftComplex *devout;
    cudaMalloc ((void**)&devout, sizeof (cufftComplex)*w*h);
cufftComplex hostd = new cufftComplex[wh];
    for (int i = 0; i < h; i++)

    {

            for (int j = 0; j < w; j++)

            {

                    hostd[i*w + j].x = ptr[i*w + j];

                    hostd[i*w + j].y = 0.f;

            }

    }
cudaMemcpy (devin, hostd, sizeof (cufftComplex)wh, cudaMemcpyHostToDevice);

printf (“-= Performing CUDA FFT forward =-\n”);
    cufftExecC2C (plan, devin, devout, CUFFT_FORWARD);
cudaMemcpy (hostd, devout, sizeof (cufftComplex)wh, cudaMemcpyDeviceToHost);

for (int i = 0; i < h; i++)
    {

            for (int j = 0; j < w; j++)

            {

                    hostd[i*w + j].x *= filter[j];

                    hostd[i*w + j].y *= filter[j];

            }

    }
delete filter;

cudaMemcpy (devin, hostd, sizeof (cufftComplex)wh, cudaMemcpyHostToDevice);

printf (“-= Performing CUDA FFT inverse =-\n”);
    cufftExecC2C (plan, devin, devout, CUFFT_INVERSE);
cudaMemcpy (hostd, devout, sizeof (cufftComplex)wh, cudaMemcpyDeviceToHost);

for (int i = 0; i < h; i++)
    {

            for (int j = 0; j < w; j++)

            {

                    ptr[i*w + j] = hostd[i*w + j].x/w;

            }

    }
delete hostd;

cufftDestroy (plan);
    cudaFree (devout);

    cudaFree (devin);
[/codebox]

Same input values, except I create two plans, one for R2C, then C2R. This produces incorrect results.

[codebox]
    cufftHandle plan1, plan2;

    cufftPlanMany (&plan1, 1, &w, NULL, 1, 0, NULL, 1, 0, CUFFT_R2C, h);

    cufftPlanMany (&plan2, 1, &w, NULL, 1, 0, NULL, 1, 0, CUFFT_C2R, h);
float *devin;
    cudaMalloc ((void**)&devin, sizeof (float)*w*h);
cufftComplex *devout;
    cudaMalloc ((void**)&devout, sizeof (cufftComplex)*w*h);
cudaMemcpy (devin, ptr, sizeof (float)wh, cudaMemcpyHostToDevice);

printf (“-= Performing CUDA FFT forward =-\n”);
    cufftExecR2C (plan1, devin, devout);
cufftComplex hostd = new cufftComplex[wh];
    cudaMemcpy (hostd, devout, sizeof (cufftComplex)*w*h, cudaMemcpyDeviceToHost);
for (int i = 0; i < h; i++)
    {

            for (int j = 0; j < w; j++)

            {

                    hostd[i*w + j].x *= filter[j];

                    hostd[i*w + j].y *= filter[j];

            }

    }
delete filter;

cudaMemcpy (devin, hostd, sizeof (cufftComplex)wh, cudaMemcpyHostToDevice);

printf (“-= Performing CUDA FFT inverse =-\n”);
    cufftExecC2R (plan2, devout, devin);
cudaMemcpy (ptr, devin, sizeof (float)wh, cudaMemcpyDeviceToHost);

delete hostd;

cufftDestroy (plan2);
    cufftDestroy (plan1);

    cudaFree (devout);

    cudaFree (devin);
[/codebox]

I can provide the fftw equivalent if its relevant. The first version, C2C, works in producing the same look, but normalizes the values (which I think is caused by the divide by width when copying back to ptr). The fftw version does not perform this normalization. The second cufft version, R2C and C2R, does not work and it returns the image, unchanged as far as i can tell. The filter being applied should greatly change the way the image looks. Thanks for any assistance!

-brad

Looks like your memcpy back to the gpu is copying to wrong array. Also, as an aside, I’m assuming that you copy the arrays back to the cpu to apply the filter for clarity. For performance, you should most definitely be doing everything on the gpu.

eelsen · August 24, 2010, 9:23pm

Hello,

I’m hoping someone can point me in the right direction on what is happening. I have three code samples, one using fftw3, the other two using cufft. My fftw example uses the real2complex functions to perform the fft. My cufft equivalent does not work, but if I manually fill a complex array the complex2complex works. Here are some code samples:

float *ptr is the array holding a 2d image which is my test case of size w, h. I apply a fft along the width.

[codebox]
    cufftHandle plan;

    cufftPlanMany (&plan, 1, &w, NULL, 1, 0, NULL, 1, 0, CUFFT_C2C, h);
cufftComplex *devin;
    cudaMalloc ((void**)&devin, sizeof (cufftComplex)*w*h);
cufftComplex *devout;
    cudaMalloc ((void**)&devout, sizeof (cufftComplex)*w*h);
cufftComplex hostd = new cufftComplex[wh];
    for (int i = 0; i < h; i++)

    {

            for (int j = 0; j < w; j++)

            {

                    hostd[i*w + j].x = ptr[i*w + j];

                    hostd[i*w + j].y = 0.f;

            }

    }
cudaMemcpy (devin, hostd, sizeof (cufftComplex)wh, cudaMemcpyHostToDevice);

printf (“-= Performing CUDA FFT forward =-\n”);
    cufftExecC2C (plan, devin, devout, CUFFT_FORWARD);
cudaMemcpy (hostd, devout, sizeof (cufftComplex)wh, cudaMemcpyDeviceToHost);

for (int i = 0; i < h; i++)
    {

            for (int j = 0; j < w; j++)

            {

                    hostd[i*w + j].x *= filter[j];

                    hostd[i*w + j].y *= filter[j];

            }

    }
delete filter;

cudaMemcpy (devin, hostd, sizeof (cufftComplex)wh, cudaMemcpyHostToDevice);

printf (“-= Performing CUDA FFT inverse =-\n”);
    cufftExecC2C (plan, devin, devout, CUFFT_INVERSE);
cudaMemcpy (hostd, devout, sizeof (cufftComplex)wh, cudaMemcpyDeviceToHost);

for (int i = 0; i < h; i++)
    {

            for (int j = 0; j < w; j++)

            {

                    ptr[i*w + j] = hostd[i*w + j].x/w;

            }

    }
delete hostd;

cufftDestroy (plan);
    cudaFree (devout);

    cudaFree (devin);
[/codebox]

Same input values, except I create two plans, one for R2C, then C2R. This produces incorrect results.

[codebox]
    cufftHandle plan1, plan2;

    cufftPlanMany (&plan1, 1, &w, NULL, 1, 0, NULL, 1, 0, CUFFT_R2C, h);

    cufftPlanMany (&plan2, 1, &w, NULL, 1, 0, NULL, 1, 0, CUFFT_C2R, h);
float *devin;
    cudaMalloc ((void**)&devin, sizeof (float)*w*h);
cufftComplex *devout;
    cudaMalloc ((void**)&devout, sizeof (cufftComplex)*w*h);
cudaMemcpy (devin, ptr, sizeof (float)wh, cudaMemcpyHostToDevice);

printf (“-= Performing CUDA FFT forward =-\n”);
    cufftExecR2C (plan1, devin, devout);
cufftComplex hostd = new cufftComplex[wh];
    cudaMemcpy (hostd, devout, sizeof (cufftComplex)*w*h, cudaMemcpyDeviceToHost);
for (int i = 0; i < h; i++)
    {

            for (int j = 0; j < w; j++)

            {

                    hostd[i*w + j].x *= filter[j];

                    hostd[i*w + j].y *= filter[j];

            }

    }
delete filter;

cudaMemcpy (devin, hostd, sizeof (cufftComplex)wh, cudaMemcpyHostToDevice);

printf (“-= Performing CUDA FFT inverse =-\n”);
    cufftExecC2R (plan2, devout, devin);
cudaMemcpy (ptr, devin, sizeof (float)wh, cudaMemcpyDeviceToHost);

delete hostd;

cufftDestroy (plan2);
    cufftDestroy (plan1);

    cudaFree (devout);

    cudaFree (devin);
[/codebox]

I can provide the fftw equivalent if its relevant. The first version, C2C, works in producing the same look, but normalizes the values (which I think is caused by the divide by width when copying back to ptr). The fftw version does not perform this normalization. The second cufft version, R2C and C2R, does not work and it returns the image, unchanged as far as i can tell. The filter being applied should greatly change the way the image looks. Thanks for any assistance!

-brad

Looks like your memcpy back to the gpu is copying to wrong array. Also, as an aside, I’m assuming that you copy the arrays back to the cpu to apply the filter for clarity. For performance, you should most definitely be doing everything on the gpu.

Cliff_Woolley · August 24, 2010, 9:27pm

Ah, yes, I agree. Good catch! Sorry I didn’t look closely enough at the code to spot those things.

PS: Even after fixing these issues, the code should still check for API errors. :)

–Cliff

Cliff_Woolley · August 24, 2010, 9:27pm

Ah, yes, I agree. Good catch! Sorry I didn’t look closely enough at the code to spot those things.

PS: Even after fixing these issues, the code should still check for API errors. :)

–Cliff

galapaegos · August 25, 2010, 3:29am

Thanks for the replies!

I’ve been checking for errors, but for clarity I omitted them in my post. The only errors I have received have been numerous “Microsoft C++ exception: cudaError_enum at memory location 0x0012fb00…”, which only occurs on the first cuda call access.

I’m using the 3.0 toolkit on redhat linux 5.3, and on my vista laptop I was running 3.0 but upgraded today to the latest 3.1 toolkit and drivers.

Yes, I had planned to apply the filter in a kernel once I find out what I’m missing.

galapaegos · August 25, 2010, 3:29am

Thanks for the replies!

I’ve been checking for errors, but for clarity I omitted them in my post. The only errors I have received have been numerous “Microsoft C++ exception: cudaError_enum at memory location 0x0012fb00…”, which only occurs on the first cuda call access.

I’m using the 3.0 toolkit on redhat linux 5.3, and on my vista laptop I was running 3.0 but upgraded today to the latest 3.1 toolkit and drivers.

Yes, I had planned to apply the filter in a kernel once I find out what I’m missing.

galapaegos · August 25, 2010, 9:56pm

I’ve added a kernel to perform the filtering, and since I made a large number of changes I will repost my code. I am using the same kernel in both the C2C and R2C->C2R versions.

[codebox]

    cufftHandle plan1, plan2;

CUFFT_CHECK (cufftPlanMany (&plan1, 1, &w, NULL, 1, 0, NULL, 1, 0, CUFFT_R2C, h));

CUFFT_CHECK (cufftPlanMany (&plan2, 1, &w, NULL, 1, 0, NULL, 1, 0, CUFFT_C2R, h));

float *devR, *devF;

CUDA_CHECK (cudaMalloc ((void**)&devR, sizeof (float)*w*h));

CUDA_CHECK (cudaMalloc ((void**)&devF, sizeof (float)*(w + 1)));

cufftComplex *devC;

CUDA_CHECK (cudaMalloc ((void**)&devC, sizeof (cufftComplex)*(w*h + 1)));

CUDA_CHECK (cudaMemcpy (devF, filter, sizeof (float)*(w + 1), cudaMemcpyHostToDevice));

CUDA_CHECK (cudaMemcpy (devR, ptr, sizeof (float)*w*h, cudaMemcpyHostToDevice));

printf ("-= Performing CUDA FFT forward =-\n");

CUFFT_CHECK (cufftExecR2C (plan1, devR, devC));

int block = 8;

runFilter (devC, devF, block, block, w, h);

CUDA_CHECK_ERROR ();

printf ("-= Performing CUDA FFT inverse =-\n");

CUFFT_CHECK (cufftExecC2R (plan2, devC, devR));

CUDA_CHECK (cudaMemcpy (ptr, devR, sizeof (float)*w*h, cudaMemcpyDeviceToHost));

CUFFT_CHECK (cufftDestroy (plan2));

CUFFT_CHECK (cufftDestroy (plan1));

CUDA_CHECK (cudaFree (devC));

CUDA_CHECK (cudaFree (devR));

CUDA_CHECK (cudaFree (devF));

[/codebox]

I am still getting odd results. It seems that using R2C and C2R is skipping lines. Here are 3 images to show the results I’m getting:

Original image: http://img827.imageshack.us/img827/4576/gr…024x1024000.png

C2C plan: http://img651.imageshack.us/img651/9572/gr…4x1024000ra.png

R2C C2R plan: http://img638.imageshack.us/img638/9572/gr…4x1024000ra.png

As the R2C plan shows, the image is performing some function on every other row. Every other line doesn’t look like its actually performing anything, in fact it is the same color. I don’t receive any API errors during execution. Maybe I’m not creating a large enough buffer for cufft?

I tried using the compatibility settings with cufftSetCompatibilityMode. Do I need to have a fermi card to run these options? ‘Native’ and ‘fftw-padding’ work, but ‘asymmetric’ and ‘all’ give me invalid plan handle errors.

Thanks again!

-brad

galapaegos · August 25, 2010, 9:56pm

I’ve added a kernel to perform the filtering, and since I made a large number of changes I will repost my code. I am using the same kernel in both the C2C and R2C->C2R versions.

[codebox]

    cufftHandle plan1, plan2;

CUFFT_CHECK (cufftPlanMany (&plan1, 1, &w, NULL, 1, 0, NULL, 1, 0, CUFFT_R2C, h));

CUFFT_CHECK (cufftPlanMany (&plan2, 1, &w, NULL, 1, 0, NULL, 1, 0, CUFFT_C2R, h));

float *devR, *devF;

CUDA_CHECK (cudaMalloc ((void**)&devR, sizeof (float)*w*h));

CUDA_CHECK (cudaMalloc ((void**)&devF, sizeof (float)*(w + 1)));

cufftComplex *devC;

CUDA_CHECK (cudaMalloc ((void**)&devC, sizeof (cufftComplex)*(w*h + 1)));

CUDA_CHECK (cudaMemcpy (devF, filter, sizeof (float)*(w + 1), cudaMemcpyHostToDevice));

CUDA_CHECK (cudaMemcpy (devR, ptr, sizeof (float)*w*h, cudaMemcpyHostToDevice));

printf ("-= Performing CUDA FFT forward =-\n");

CUFFT_CHECK (cufftExecR2C (plan1, devR, devC));

int block = 8;

runFilter (devC, devF, block, block, w, h);

CUDA_CHECK_ERROR ();

printf ("-= Performing CUDA FFT inverse =-\n");

CUFFT_CHECK (cufftExecC2R (plan2, devC, devR));

CUDA_CHECK (cudaMemcpy (ptr, devR, sizeof (float)*w*h, cudaMemcpyDeviceToHost));

CUFFT_CHECK (cufftDestroy (plan2));

CUFFT_CHECK (cufftDestroy (plan1));

CUDA_CHECK (cudaFree (devC));

CUDA_CHECK (cudaFree (devR));

CUDA_CHECK (cudaFree (devF));

[/codebox]

I am still getting odd results. It seems that using R2C and C2R is skipping lines. Here are 3 images to show the results I’m getting:

Original image: http://img827.imageshack.us/img827/4576/gr…024x1024000.png

C2C plan: http://img651.imageshack.us/img651/9572/gr…4x1024000ra.png

R2C C2R plan: http://img638.imageshack.us/img638/9572/gr…4x1024000ra.png

As the R2C plan shows, the image is performing some function on every other row. Every other line doesn’t look like its actually performing anything, in fact it is the same color. I don’t receive any API errors during execution. Maybe I’m not creating a large enough buffer for cufft?

I tried using the compatibility settings with cufftSetCompatibilityMode. Do I need to have a fermi card to run these options? ‘Native’ and ‘fftw-padding’ work, but ‘asymmetric’ and ‘all’ give me invalid plan handle errors.

Thanks again!

-brad

Topic		Replies	Views
Questions about cuFFT for 3D matrix, arrayFire GPU-Accelerated Libraries	5	1668	October 12, 2021
[SOLVED] cuFFT data storage, maybe I'm operating on the wrong elements CUDA Programming and Performance	16	2134	June 11, 2022
CUDA FFT different from Matlab FFT CUDA Programming and Performance	32	9336	March 29, 2011
2D CUFFT wrong result GPU-Accelerated Libraries cufft	8	3102	November 7, 2023
CUFFT run wrong CUDA Programming and Performance	16	2809	May 23, 2013
CUFFT bug in Cuda 4.0 Release Candidate 2 CUDA Programming and Performance	8	1639	May 5, 2011
Image filtering in frequency domain GPU-Accelerated Libraries	6	3624	October 3, 2014
2D Convolution problem following example from SDK source code included CUDA Programming and Performance	9	11646	June 7, 2011
The multi-gpu fft 3D R2C problem GPU-Accelerated Libraries	5	661	September 9, 2019
Apparently bug in CUFFT of CUDA 7.5 with (deprecated) NATIVE Compatibility CUDA Programming and Performance	3	1755	December 21, 2015

cuFFT and fftw

Related topics