Does cufftPlan3d allocate additional memory? Why?


I’m using CUFFT library to perform a 3D Fourier transform. My card is GeForce 8500 GT, CUDA 2.1 under GNU/Linux. I use in-place transform to save memory space but as it seems to me, the call of ‘cufftPlan3d’ allocates additional memory anyway. Here’s a sample code:

[font=“courier”][indent] // Allocate device memory

Complex* d_signal, d_filter_kernel;

cutilSafeCall(cudaMalloc((void**)&d_signal, mem_size));

cutilSafeCall(cudaMalloc((void**)&d_filter_kernel, mem_size));

// Copy host memory to device

cutilSafeCall(cudaMemcpy(d_signal, h_padded_signal, mem_size, cudaMemcpyHostToDevice));

cutilSafeCall(cudaMemcpy(d_filter_kernel, h_padded_filter_kernel, mem_size, cudaMemcpyHostToDevice));

cuMemGetInfo( &theFree, &theTotal );

printf("CARD memory after allocating memory:\n");

printf("Free:  %d\nTotal: %d\nAllocated: %d\n", theFree, theTotal, theTotal-theFree);

// CUFFT plan

cufftHandle plan;

cufftSafeCall(cufftPlan3d(&plan, size_x, size_y, size_z, CUFFT_C2C));

cuMemGetInfo( &theFree, &theTotal );

printf("CARD memory after running cufftPlan3d:\n");

printf("Free:  %d\nTotal: %d\nAllocated: %d\n", theFree, theTotal, theTotal-theFree);

// Transform signal and kernel

cufftSafeCall(cufftExecC2C(plan, (cufftComplex *)d_signal, (cufftComplex *)d_signal, CUFFT_FORWARD));

cufftSafeCall(cufftExecC2C(plan, (cufftComplex *)d_filter_kernel, (cufftComplex *)d_filter_kernel, CUFFT_FORWARD));

// Multiply the coefficients together and normalize the result

cutilSafeCall( cudaThreadSynchronize() );

ComplexPointwiseMulAndScale<<<GRID, THREADS>>>(d_signal, d_filter_kernel, size_x * size_y * size_z, norm_factor);

// Transform signal back

cutilSafeCall( cudaThreadSynchronize() );

cufftSafeCall(cufftExecC2C(plan, (cufftComplex *)d_signal, (cufftComplex *)d_signal, CUFFT_INVERSE));


The output is following:


[indent]CARD memory after allocating memory:

free: 163020544

total: 267714560

allocated: 104694016

CARD memory after running cufftPlan3d:

free: 93740800

total: 267714560

allocated: 173973760


It seems to me the situation is the same when using 1D (maybe even 2D?) transform. In fact, the amount of data has to be approximately half of the GPU memory size. Otherwise, the program crashes because of lack of memory. Is there any chance to perform FFT without allocating additional memory?

I have made some further tests and I have found that this memory ‘effect’ appears only for some data sizes. For example, it does NOT appear for 3D FFT of array 320 x 320 x 130, however, it DOES appear in case of array 1088 x 1088 x 4. It appears in case of 4 x 1088 x 1088 as well. (Don’t care about these strange numbers, the program is doing a convolution so the input signal needs to be padded according to filter kernel size.)

It seems to me that if the input signal is too large in two dimensions, the 3D FFT is split into several 2D FFT and therefore additional memory needs to be allocated. I would still appreciate if someone could tell me whether I am right or not, and furthermore, if this ‘effect’ can be avoided.