CUFFT bug in Cuda 4.0 Release Candidate 2

I’m using Linux, CentOS versus 5.4, host compiler is gcc v 4.1.2.

With Cuda 4.0 RC 2, FFT’s seem to be broken for “in place” transforms. I’m using batched 1D transforms R2C and C2R. The problem does look like it’s part of the batch. I’m transforming some images, doing 1D transforms on each row. When just running forwards then backwards transforms with nothing in between, a strange diagonal line appears in the image. It screws with my computations that I’ve been doing since CUDA 3.0

It should be simple to verify this bug by doing a forward then backwards transform, and getting much different results.

Simply doing “out of place” transforms fixed the issues I’m having, but you guys may want to fix this.

if(cufftPlan1d(&plan1, WIDTHPOW2, CUFFT_R2C,HEIGHT) != CUFFT_SUCCESS) {

		printf("CUFFT ERROR PLAN 1\n");

		exit(1);

	}

	if(cufftPlan1d(&plan2, WIDTHPOW2, CUFFT_C2R,HEIGHT) != CUFFT_SUCCESS) {

		printf("CUFFT ERROR PLAN 2\n");

		exit(1);

	}

if(cufftExecR2C(plan1, c_image,

(cufftComplex *)c_image) !=

		 CUFFT_SUCCESS) {

		printf("CUFFT FAILURE exec 1!\n");

		exit(1);

	}

	cudaThreadSynchronize();

if(cufftExecC2R(plan2,

(cufftComplex *)c_image,c_image) != CUFFT_SUCCESS) {

		printf("CUFFT FAILURE exec 1!\n");

		exit(1);

	}

		cudaThreadSynchronize();

Can you please file a bug, attaching a self-contained repro case? Since you indicate that this is a regression from earlier versions of CUDA, it would be helpful if you could also attach the output from running the repro app with CUDA 3.2 vs running with CUDA 4.0. Thank you for your help!

Where do you file a bug report?

I could maybe make a complete program that reproduces the error… but I’m not going to move between cuda 3.2 and cuda 4.0 RC2 just to try to illustrate different output.

Your staff should just be able to fill up a floating array with numbers, apply the in place transforms forwards and backwards, and see that the output and inputs don’t match like they should.

You need to be a registered developer to file a bug. If you are not, you can apply at http://developer.nvidia.com/join-nvidia-registered-developer-program

Going back to the bug, the error is probably in your memory allocation.

This example does an inplace transform and it gives the correct results, both on 3.2 and 4.0 RC2

$ cat rfft.cu 

#include "cufft.h"

int main()

{

 cufftReal *input, *input_d;

 cufftReal *inplace, *inplace_d;

 cufftComplex *outofplace_d;

 cufftHandle r2cplan,c2rplan;

 int i,N=8;

// Allocate memory

 input=(cufftReal*) malloc(sizeof(cufftReal)*N);

 inplace=(cufftReal*) malloc(sizeof(cufftReal)*(N+2));

cudaMalloc( (void**)& input_d, sizeof(cufftReal)*N); 

 cudaMalloc( (void**)& inplace_d, sizeof(cufftReal)*(N+2)); 

 cudaMalloc( (void**)& outofplace_d, sizeof(cufftComplex)*(N/2+1)*2);

// Out of place transform

 for (i=0;i<N;i++) input[i]=(float)i;

printf (" Initial signal\n");

 for (i=0;i<N;i++)  printf("%d %f\n",i,input[i]);

cudaMemcpy(input_d,input,sizeof(cufftReal)*N,cudaMemcpyHostToDevice);

for (i=0;i<N;i++) input[i]=0.;

cufftPlan1d(&r2cplan,N,CUFFT_R2C,1);

 cufftPlan1d(&c2rplan,N,CUFFT_C2R,1);

 cufftExecR2C(r2cplan,input_d,outofplace_d);

 cufftExecC2R(c2rplan,outofplace_d,input_d);

 cudaMemcpy(input,input_d,sizeof(cufftReal)*N,cudaMemcpyDeviceToHost);

printf (" Out of place transform \n");

 printf (" Signal      Normalized signal\n");

 for (i=0;i<N;i++)  printf("%d %f %f\n",i,input[i], input[i]/N);

// Inplace transform

 for (i=0;i<N;i++) inplace[i]=(float)i;

cudaMemcpy(inplace_d,inplace,sizeof(cufftReal)*N,cudaMemcpyHostToDevice);

 for (i=0;i<N;i++) inplace[i]=0.;

cufftExecR2C(r2cplan,inplace_d,(cufftComplex*)inplace_d);

 cufftExecC2R(c2rplan,(cufftComplex*)inplace_d,inplace_d);

cudaMemcpy(inplace,inplace_d,sizeof(cufftReal)*N,cudaMemcpyDeviceToHost);

printf (" In place transform \n");

 printf (" Signal      Normalized signal\n");

 for (i=0;i<N;i++)  printf("%d %f %f\n",i,inplace[i], inplace[i]/N);

cufftDestroy(r2cplan);

 cufftDestroy(c2rplan);

free(input);

 free(inplace);

 cudaFree(input_d);

 cudaFree(inplace_d);

 cudaFree(outofplace_d);

}

If you compile the code with

nvcc rfft.cu -L/usr/local/cuda/lib64 -lcufft

The output is the correct one:

Initial signal

0 0.000000

1 1.000000

2 2.000000

3 3.000000

4 4.000000

5 5.000000

6 6.000000

7 7.000000

 Out of place transform 

 Signal      Normalized signal

0 0.000000 0.000000

1 8.000000 1.000000

2 16.000000 2.000000

3 24.000000 3.000000

4 32.000000 4.000000

5 40.000000 5.000000

6 48.000000 6.000000

7 56.000000 7.000000

 In place transform 

 Signal      Normalized signal

0 0.000000 0.000000

1 8.000000 1.000000

2 16.000000 2.000000

3 24.000000 3.000000

4 32.000000 4.000000

5 40.000000 5.000000

6 48.000000 6.000000

7 56.000000 7.000000

Here, I was able to rewrite your code to replicate the problem I was having.

I’m using batched 1D transforms… You can set N and BATCH to whatever you want. I really don’t recall any such problem as the one you’ll see with this code with CUDA 3.2.

You can see in place and out of place giving different results… and in place are setting some numbers to 0 when it really shouldn’t. The problem seems to be with BATCHED in place transforms. The program is changed to only printf on “bad output”.

#include "cufft.h"

int main()

{

 cufftReal *input, *input_d;

 cufftReal *inplace, *inplace_d;

 cufftComplex *outofplace_d;

 cufftHandle r2cplan,c2rplan;

 int i,N=8,BATCH=50;

// Allocate memory

 input=(cufftReal*) malloc(sizeof(cufftReal)*N*BATCH);

 inplace=(cufftReal*) malloc(sizeof(cufftReal)*(N+2)*BATCH);

cudaMalloc( (void**)& input_d, sizeof(cufftReal)*N*BATCH); 

 cudaMalloc( (void**)& inplace_d, sizeof(cufftReal)*(N+2)*BATCH); 

 cudaMalloc( (void**)& outofplace_d, sizeof(cufftComplex)*(N/2+1)*2*BATCH);

// Out of place transform

 for (i=0;i<N*BATCH;i++) input[i]=(float)i;

printf (" Initial signal\n");

 //for (i=0;i<N*BATCH;i++)  printf("%d %f\n",i,input[i]);

cudaMemcpy(input_d,input,sizeof(cufftReal)*N*BATCH,cudaMemcpyHostToDevice);

for (i=0;i<N*BATCH;i++) input[i]=0.;

cufftPlan1d(&r2cplan,N,CUFFT_R2C,BATCH);

 cufftPlan1d(&c2rplan,N,CUFFT_C2R,BATCH);

 cufftExecR2C(r2cplan,input_d,outofplace_d);

 cufftExecC2R(c2rplan,outofplace_d,input_d);

 cudaMemcpy(input,input_d,sizeof(cufftReal)*N*BATCH,cudaMemcpyDeviceToHost);

printf (" Out of place transform \n");

 printf (" Signal      Normalized signal\n");

 for (i=0;i<N*BATCH;i++)  {

        if((fabs((float) input[i]/N-(float)i)) > 1.0) {

                printf("BAD VALUE %d %f %f\n",i,input[i], input[i]/N);

        }

 }

// Inplace transform

 for (i=0;i<N*BATCH;i++) inplace[i]=(float)i;

cudaMemcpy(inplace_d,inplace,sizeof(cufftReal)*N*BATCH,cudaMemcpyHostToDevice);

for (i=0;i<N*BATCH;i++) inplace[i]=0.;

cufftExecR2C(r2cplan,inplace_d,(cufftComplex*)inplace_d);

 cufftExecC2R(c2rplan,(cufftComplex*)inplace_d,inplace_d);

cudaMemcpy(inplace,inplace_d,sizeof(cufftReal)*N*BATCH,cudaMemcpyDeviceToHost);

printf (" In place transform \n");

 printf (" Signal      Normalized signal\n");

 for (i=0;i<N*BATCH;i++)  {

        if((fabs((float) inplace[i]/N-(float)i)) > 1.0) {

                printf("BAD VALUE: %d %f %f\n",i,inplace[i], inplace[i]/N);

        }

 }

 cufftDestroy(r2cplan);

 cufftDestroy(c2rplan);

free(input);

 free(inplace);

 cudaFree(input_d);

 cudaFree(inplace_d);

 cudaFree(outofplace_d);

}

Your memcopies are incorrect.

One array is (N,BATCH), the other one is (N+2,BATCH) so you cannot use a linear cudamemcpy but you need to use a cudamemcpy2d.

This is a working code, just for the inplace.

#include "cufft.h"

int main()

{

 cufftReal *input;

 cufftReal *inplace_d;

 cufftHandle r2cplan,c2rplan;

 int i,N=8,BATCH=50;

// Allocate memory

 input=(cufftReal*) malloc(sizeof(cufftReal)*N*BATCH);

 cudaMalloc( (void**)& inplace_d, sizeof(cufftReal)*(N+2)*BATCH);

// Create plans

 cufftPlan1d(&r2cplan,N,CUFFT_R2C,BATCH);

 cufftPlan1d(&c2rplan,N,CUFFT_C2R,BATCH);

// Inplace transform

 for (i=0;i<N*BATCH;i++) input[i]=(float)i;

// Copy CPU array input(N,BATCH) to inplace(N+2,BATCH) on GPU

 cudaMemcpy2D(inplace_d,sizeof(cufftReal)*(N+2),input,sizeof(cufftReal)*(N),sizeof(cufftReal)*N,BATCH,cudaMemcpyHostToDevice);

for (i=0;i<N*BATCH;i++) input[i]=0.;

cufftExecR2C(r2cplan,inplace_d,(cufftComplex*)inplace_d);

 cufftExecC2R(c2rplan,(cufftComplex*)inplace_d,inplace_d);

// Copy GPU array inplace(N+2,BATCH) back to input(N,BATCH)

 cudaMemcpy2D(input,sizeof(cufftReal)*(N),inplace_d,sizeof(cufftReal)*(N+2),sizeof(cufftReal)*N,BATCH,cudaMemcpyDeviceToHost);

printf (" In place transform \n");

 printf (" Signal      Normalized signal\n");

 for (i=0;i<N*BATCH;i++)  {

        if((fabs((float) input[i]/N-(float)i)) > 1.0) {

                printf("BAD VALUE: %d %f %f\n",i,input[i], input[i]/N);

        }

 }

 cufftDestroy(r2cplan);

 cufftDestroy(c2rplan);

free(input);

 cudaFree(inplace_d);

}

Well, does that last code segment work on CUDA 3.2? Because I’d think it wouldn’t. However, I’m not going to make the attempt… I’m going to leave my 4.0 in for now.

My code works on 3.2 and 4.0, your version was incorrect on both.

I see. Thank you for your help on the issue. Since we’ve put a good amount of money into Tesla’s, I appreciate your feedback and help with my misunderstanding.

You may want to add something like the above code to the examples section of the CUFFT library… you know, since the code’s there now. I see it’s in the CUFFT documentation (Feb 11 edition) on page 8, but it seems a little tricky.