Using a 8800GTS (320MB) with 0.9 SDK. Yes, this is the only allocation on the device, and no other use of the card going on (except my plain KDE linux desktop).

void
runTest(
int argc,
char** argv
)
{
//.........................................
CUT_CHECK_DEVICE();
//.........................................
rows = 4096;
cols = 4096;
unsigned int imgSize = cols * rows;
unsigned int memSize = ( cols * rows ) * sizeof( Complex );
// Allocate host memory for the signal (my image)
Complex *pImage = ( Complex * ) malloc( memSize );
// Initalize the image to something
for ( unsigned int i = 0; i < imgSize; ++i ) {
pImage[ i ].x = rand() / ( float )RAND_MAX;
pImage[ i ].y = 0;
}
// Allocate device memory for signal
Complex *d_pImage;
//..........................................
CUDA_SAFE_CALL( cudaMalloc( ( void ** ) &d_pImage, memSize ) );
// Copy host memory to device
CUDA_SAFE_CALL ( cudaMemcpy( d_pImage, pImage, memSize, cudaMemcpyHostToDevice ) );
//.........................................
// CUFFT plan
cufftHandle plan;
// %%%%%%% THIS FAILS FOR 4096x4096 %%%%%%%%
CUFFT_SAFE_CALL( cufftPlan2d( &plan, rows, cols, CUFFT_C2C ) );
// %%%%%%%%%%%%%%%%%%%%%
....

I am running into the same sort of problems. I am using a 8800GTX 768MB with CUDA 1.0, CUFFT 1.0 on WinXP 32 bit, I allocate 6 buffers in global memory for a 3584x3584 image and plan a 2D fft and inverse fft of real-to-complex data out-of-place.

If I don’t use CUFFT I can allocate 350MB more global memory then when I do use CUFFT ( <img src=‘http://hqnveipbwb20/public/style_emoticons/<#EMO_DIR#>/crying.gif’ class=‘bbc_emoticon’ alt=’:’(’ /> That is 7 more buffers with the complete image !!! :unsure: ).

Since the implications of this memory usage are very hard for my application, I would like some insight in:

It appears to me that the biggest 1d FFT you can plan is a 8M pt fft, if you try to plan a 16M pt fft it fails. Given that I would expect a 4kx4k 2D fft to also fail since it’s essentially the same thing.

I would try (it’s going to be a lot more complicated) if you HAVE to have that size, using multiple smaller FFTs to perform the same thing. There will be some intermediate stages to muck with the data order, but if you consult your signal processing texts it should be doable.

Ok, I’ve run some tests to get an idea on how much memory CUFFT actually uses. Test results are below, Image size is number of floating point pixels; Memory CUFFT is the total amount of memory used by CUFFT; Memory Buffer is the amount of memory used for the input and output buffer.

Image size || Memory CUFFT || Memory Buffer
-------------------------------------------------------------
256 x 256 || 68616192 bytes || 524288 bytes
512 x 512 || 68616192 bytes || 2097152 bytes
768 x 768 || 96141312 bytes || 4718592 bytes
1024 x 1024 || 68616192 bytes || 8388608 bytes
1280 x 1280 || 149618688 bytes || 13107200 bytes
1536 x 1536 || 175570944 bytes || 18874368 bytes
1792 x 1792 || 200998912 bytes || 25690112 bytes
2048 x 2048 || 175570944 bytes || 33554432 bytes
2304 x 2304 || 297467904 bytes || 42467328 bytes
2560 x 2560 || 342556672 bytes || 52428800 bytes
2816 x 2816 || 406257664 bytes || 63438848 bytes
3072 x 3072 || 483852288 bytes || 75497472 bytes
3328 x 3328 || 553058304 bytes || 88604672 bytes
3584 x 3584 || 627769344 bytes || 102760448 bytes
3840 x 3840 || 710344704 bytes || 117964800 bytes
4096 x 4096 || 792133632 bytes || 134217728 bytes
4352 x 4352 || 737869824 bytes || 151519232 bytes
4608 x 4608 || 737869824 bytes || 169869312 bytes

These numbers follow a quadratic trend (because image sizes increas quadratic). I extrapolated the results to get an idea of what size of transforms might be available on Tesla and future architectures, but those results aren’t very promising:

a transform of 5120x5120 takes around 1,1 GB; 6144x6144 takes 1,6 GB (will run out of memory on Tesla platforms?)

Performing similar tests with FFTW also indicated a high memory usage, but it is far more limited than with CUFFT. A 6144x6144 fft used ‘only’ 600 MB of memory (Virtual memory included).

With respect to these problems: any news on this quote?

Thnx,

Arno

(Ps. SrJsignal: Thanks for the suggestion, maybe I’ll try to split up the computation, if I might succeed I’ll post the result)

Good luck, we’ve kind of dropped this project for now, we do real-time processing and the lack of concurrent IO/processing basically makes using CUDA a deal-breaker. We had our own in house estimates of what doing the FFTs by hand would have gained, I’d be interested what you come up with.

how do you get your test result please? I made some simple test by adding code cuMemGetInfo( &theFree, &theTotal ) to get the free memory at this moment. But what puzzle me is that although I add the code to different program location, such as before “transforming convolution kernel”,before and after “running GPU FFT convolution”, and other locations where may consuming the memory, the print result is strange that all the result is the same, it seems that the memory which FFT used is allocated at the phase of creating FFT plan, isn’t it ? can you tell me your method to get your result? beg for your reply.

The basing sample is “convolutionFFT2D”,below is my result:

the Total Memory is :1610285056
Input data size : 1018 x 1018
Convolution kernel size : 7 x 7
Padded image size : 1024 x 1024
Aligned padded image size : 1024 x 1024
Allocating memory...
Generating random input data...
Before Creating FFT plan!
the Free Memory is :1530855424 the Total Memory is :1610285056
Creating FFT plan for 1024 x 1024...
After Creating FFT plan!
the Free Memory is :1522466816 the Total Memory is :1610285056
Uploading to GPU and padding convolution kernel and input data...
...initializing padded kernel and data storage with zeroes...
...copying input data and convolution kernel from host to CUDA arrays
...binding CUDA arrays to texture references
...padding convolution kernel
...padding input data array
After padding input data, starting FFT transformation！
the Free Memory is :1522466816 the Total Memory is :1610285056
Transforming convolution kernel...
finish FFT kernel transformation,start running GPU FFT convolution
the Free Memory is :1522466816 the Total Memory is :1610285056
Running GPU FFT convolution...
finish input data transformation, start multiplying operation
the Free Memory is :1522466816 the Total Memory is :1610285056
finish multiplying operation, and start inverse FFT！
the Free Memory is :1522466816 the Total Memory is :1610285056
GPU time: 11.077584 msecs. //93.551444 MPix/s
finish inverse FFT！
the Free Memory is :1522466816 the Total Memory is :1610285056
Reading back GPU FFT results...
Checking GPU results...
...running reference CPU convolution
CPU time: 3587.733154 msecs. //0.288852 MPix/s
the performance diversity is :323.873244
...comparing the results
L2 norm: 2.053509E-007
TEST PASSED
Shutting down...
Press ENTER to exit...

next is the free memory and total memory of different input data size:

Nice to see that someone else can back me up on these results

The memory is indeed allocated during planning of the FFT and it is released whenever the plan is destroyed. I take it that you cannot run a 6144x6144 sized FFT on your GPU (Tesla or Quadro?)

yes, when the input data size is 6144×6144, the FFT can not run as below:

the Total Memory is :1610285056
Input data size : 6144 x 6144
Convolution kernel size : 7 x 7
Padded image size : 6150 x 6150
Aligned padded image size : 6656 x 6656
Allocating memory...
Generating random input data...
Before Creating FFT plan!
the Free Memory is :545128448 the Total Memory is :1610285056
Creating FFT plan for 6656 x 6656...
cufft: ERROR: config.cu, line 228
cufft: ERROR: CUFFT_ALLOC_FAILED
CUFFT error in file 'convolutionFFT2D.cu' in line 261.

thanks for your reply.

but another question is that why before creating FFT plan, so much memory has been consumed? The phase of initializing should not consuming so much memory in my mind, can you tell me the reasons? looking for your reply.

The difference is roughly 50Mb of memory. This could be in pitch or stride of the cudaMallocArray (just guessing) or in other memory you may have allocated?

I’m guessing that there’s something funny in the memory “watching” going on.

Also, I’ve noticed that the speed is crazy slow with non power of 2 sized FFTs, which is to be expected, but I would say that the inefficiencies of doing a non-power of 2 probably require more memory usage. I’m not sure on that, but wouldn’t be surprised.