cufftPlan2d fails

I’m trying to compute FFT of a big 2D image (4096x4096).

When I register my plan:

CUFFT_SAFE_CALL( cufftPlan2d( &plan, rows, cols, CUFFT_C2C ) );

it fails with:

cufft: ERROR:, line 228


It works fine with images up to 2048 squared.

Any hints ?

Which card are you using?
Is this the only allocation you are doing on the device?


Using a 8800GTS (320MB) with 0.9 SDK. Yes, this is the only allocation on the device, and no other use of the card going on (except my plain KDE linux desktop).



        int argc,

        char** argv






 rows = 4096;

  cols = 4096;

 unsigned int  imgSize = cols * rows;

  unsigned int  memSize = ( cols * rows ) * sizeof( Complex );

 // Allocate host memory for the signal (my image)

  Complex *pImage = ( Complex * ) malloc( memSize );

 // Initalize the image to something  

  for ( unsigned int i = 0; i < imgSize; ++i ) {

   pImage[ i ].x = rand() / ( float )RAND_MAX;

    pImage[ i ].y = 0;


 // Allocate device memory for signal

  Complex       *d_pImage;


  CUDA_SAFE_CALL( cudaMalloc( ( void ** ) &d_pImage, memSize ) );         

  // Copy host memory to device

  CUDA_SAFE_CALL ( cudaMemcpy( d_pImage, pImage, memSize,  cudaMemcpyHostToDevice ) );


 // CUFFT plan

  cufftHandle         plan;

 // %%%%%%% THIS FAILS FOR 4096x4096 %%%%%%%%

  CUFFT_SAFE_CALL( cufftPlan2d( &plan, rows, cols, CUFFT_C2C ) );

  // %%%%%%%%%%%%%%%%%%%%%


Will it succeed on an 8800GTX 768MB?

I seriously need some guidance here. Mark? Simon?

I’m using SDK 1.0 now with a 8800GTX 768MB.

I slightly modified the 2D FFT convolution example convolutionFFT2D to do:

7x7 kernel

4088x4088 image (will be auto-padded to 4096)

I inserted the following code before and after the cufftPlan2d on line 242:

 cuMemGetInfo( &theFree, &theTotal );

  printf( "CARD returns:  free:%d  total:%d\n", theFree, theTotal   );

After compiling and running, I get:


CARD returns:  free:321585152  total:804585472

Creating FFT plan for 4096 x 4096...

cufft: ERROR:, line 239


CARD returns:  free:321585152  total:804585472


As you can see, the cudaMalloc’s don’t fail, just the call to the plan.

Any ideas? Can anybody reproduce this ?

I am running into the same sort of problems. I am using a 8800GTX 768MB with CUDA 1.0, CUFFT 1.0 on WinXP 32 bit, I allocate 6 buffers in global memory for a 3584x3584 image and plan a 2D fft and inverse fft of real-to-complex data out-of-place.

If I don’t use CUFFT I can allocate 350MB more global memory then when I do use CUFFT ( <img src=‘http://hqnveipbwb20/public/style_emoticons/<#EMO_DIR#>/crying.gif’ class=‘bbc_emoticon’ alt=’:’(’ /> That is 7 more buffers with the complete image !!! :unsure: ).

Since the implications of this memory usage are very hard for my application, I would like some insight in:

  1. how much memory CUFFT actually uses

  2. why CUFFT uses that much memory

  3. how this can be avoided/reduced

Tnx, Arno

It appears to me that the biggest 1d FFT you can plan is a 8M pt fft, if you try to plan a 16M pt fft it fails. Given that I would expect a 4kx4k 2D fft to also fail since it’s essentially the same thing.

I would try (it’s going to be a lot more complicated) if you HAVE to have that size, using multiple smaller FFTs to perform the same thing. There will be some intermediate stages to muck with the data order, but if you consult your signal processing texts it should be doable.

Ok, I’ve run some tests to get an idea on how much memory CUFFT actually uses. Test results are below, Image size is number of floating point pixels; Memory CUFFT is the total amount of memory used by CUFFT; Memory Buffer is the amount of memory used for the input and output buffer.

Image size  || Memory CUFFT    || Memory Buffer   


  256 x  256 ||  68616192 bytes ||    524288 bytes 

  512 x  512 ||  68616192 bytes ||   2097152 bytes 

  768 x  768 ||  96141312 bytes ||   4718592 bytes 

 1024 x 1024 ||  68616192 bytes ||   8388608 bytes 

 1280 x 1280 || 149618688 bytes ||  13107200 bytes 

 1536 x 1536 || 175570944 bytes ||  18874368 bytes 

 1792 x 1792 || 200998912 bytes ||  25690112 bytes 

 2048 x 2048 || 175570944 bytes ||  33554432 bytes 

 2304 x 2304 || 297467904 bytes ||  42467328 bytes 

 2560 x 2560 || 342556672 bytes ||  52428800 bytes 

 2816 x 2816 || 406257664 bytes ||  63438848 bytes 

 3072 x 3072 || 483852288 bytes ||  75497472 bytes 

 3328 x 3328 || 553058304 bytes ||  88604672 bytes 

 3584 x 3584 || 627769344 bytes || 102760448 bytes 

 3840 x 3840 || 710344704 bytes || 117964800 bytes 

 4096 x 4096 || 792133632 bytes || 134217728 bytes 

 4352 x 4352 || 737869824 bytes || 151519232 bytes 

 4608 x 4608 || 737869824 bytes || 169869312 bytes

These numbers follow a quadratic trend (because image sizes increas quadratic). I extrapolated the results to get an idea of what size of transforms might be available on Tesla and future architectures, but those results aren’t very promising:

a transform of 5120x5120 takes around 1,1 GB; 6144x6144 takes 1,6 GB (will run out of memory on Tesla platforms?)

Performing similar tests with FFTW also indicated a high memory usage, but it is far more limited than with CUFFT. A 6144x6144 fft used ‘only’ 600 MB of memory (Virtual memory included).

With respect to these problems: any news on this quote?



(Ps. SrJsignal: Thanks for the suggestion, maybe I’ll try to split up the computation, if I might succeed I’ll post the result)

Good luck, we’ve kind of dropped this project for now, we do real-time processing and the lack of concurrent IO/processing basically makes using CUDA a deal-breaker. We had our own in house estimates of what doing the FFTs by hand would have gained, I’d be interested what you come up with.

how do you get your test result please? I made some simple test by adding code cuMemGetInfo( &theFree, &theTotal ) to get the free memory at this moment. But what puzzle me is that although I add the code to different program location, such as before “transforming convolution kernel”,before and after “running GPU FFT convolution”, and other locations where may consuming the memory, the print result is strange that all the result is the same, it seems that the memory which FFT used is allocated at the phase of creating FFT plan, isn’t it ? can you tell me your method to get your result? beg for your reply.

The basing sample is “convolutionFFT2D”,below is my result:

the Total Memory is :1610285056

Input data size           : 1018 x 1018

Convolution kernel size   : 7 x 7

Padded image size         : 1024 x 1024

Aligned padded image size : 1024 x 1024

Allocating memory...

Generating random input data...

Before Creating FFT plan!

the Free Memory is :1530855424  the Total Memory is :1610285056

Creating FFT plan for 1024 x 1024...

After Creating FFT plan!

the Free Memory is :1522466816  the Total Memory is :1610285056

Uploading to GPU and padding convolution kernel and input data...

...initializing padded kernel and data storage with zeroes...

...copying input data and convolution kernel from host to CUDA arrays

...binding CUDA arrays to texture references

...padding convolution kernel

...padding input data array

After padding input data, starting FFT transformation!

the Free Memory is :1522466816  the Total Memory is :1610285056

Transforming convolution kernel...

finish FFT kernel transformation,start running GPU FFT convolution

the Free Memory is :1522466816  the Total Memory is :1610285056

Running GPU FFT convolution...

finish input data transformation, start multiplying operation

the Free Memory is :1522466816  the Total Memory is :1610285056

finish multiplying operation, and start inverse FFT!

the Free Memory is :1522466816  the Total Memory is :1610285056

GPU time: 11.077584 msecs. //93.551444 MPix/s

finish inverse FFT!

the Free Memory is :1522466816  the Total Memory is :1610285056

Reading back GPU FFT results...

Checking GPU results...

...running reference CPU convolution

CPU time: 3587.733154 msecs. //0.288852 MPix/s

the performance diversity is :323.873244

...comparing the results

L2 norm: 2.053509E-007


Shutting down...

Press ENTER to exit...

next is the free memory and total memory of different input data size:

input size                  free memory(byte)        total memory(byte)

5120×5120 :                  297664512                  1610285056

4096×4096 :                  750583808                  1610285056

3584×3584 :                  966721536                  1610285056

3072×3072 :                  1103101952                 1610285056

2048×2048 :                  1388183552                 1610285056

1024×1024 :                  1522335744                 1610285056

Nice to see that someone else can back me up on these results :thumbup:

The memory is indeed allocated during planning of the FFT and it is released whenever the plan is destroyed. I take it that you cannot run a 6144x6144 sized FFT on your GPU (Tesla or Quadro?)

yes, when the input data size is 6144×6144, the FFT can not run as below:

the Total Memory is :1610285056

Input data size           : 6144 x 6144

Convolution kernel size   : 7 x 7

Padded image size         : 6150 x 6150

Aligned padded image size : 6656 x 6656

Allocating memory...

Generating random input data...

Before Creating FFT plan!

the Free Memory is :545128448   the Total Memory is :1610285056

Creating FFT plan for 6656 x 6656...

cufft: ERROR:, line 228


CUFFT error in file '' in line 261.

thanks for your reply.

but another question is that why before creating FFT plan, so much memory has been consumed? The phase of initializing should not consuming so much memory in my mind, can you tell me the reasons? looking for your reply.

If youre using “convolutionFFT2D”, you are allocating:

       CUDA_SAFE_CALL( cudaMallocArray(&a_Kernel, &float2tex, KERNEL_W, KERNEL_H) );

        CUDA_SAFE_CALL( cudaMallocArray(&a_Data,   &float2tex,   DATA_W,   DATA_H) );

        CUDA_SAFE_CALL( cudaMalloc((void **)&d_PaddedKernel, FFT_SIZE) );

        CUDA_SAFE_CALL( cudaMalloc((void **)&d_PaddedData,   FFT_SIZE) );

a_Kernel: 778 = 392 bytes

a_Data: 614461448 = 301989888 bytes

d_PKernel: 665666568 = 354418688 bytes

d_PData: 665666568 = 354418688 bytes

------------------------------------ +

Total allocated: 1010827656 bytes

Total available: 1610285056 bytes

Memory left: 599457400 bytes

theFree: 545128448 bytes

Difference: 54328952 bytes

The difference is roughly 50Mb of memory. This could be in pitch or stride of the cudaMallocArray (just guessing) or in other memory you may have allocated?

All of my numbers are for a R2C 1D FFT.

I’m guessing that there’s something funny in the memory “watching” going on.

Also, I’ve noticed that the speed is crazy slow with non power of 2 sized FFTs, which is to be expected, but I would say that the inefficiencies of doing a non-power of 2 probably require more memory usage. I’m not sure on that, but wouldn’t be surprised.

yes, I think I have got my answer. thanks for your reply.