Hello,
I’m using CUFFT library to perform a 3D Fourier transform. My card is GeForce 8500 GT, CUDA 2.1 under GNU/Linux. I use in-place transform to save memory space but as it seems to me, the call of ‘cufftPlan3d’ allocates additional memory anyway. Here’s a sample code:
[font=“courier”][indent] // Allocate device memory
Complex* d_signal, d_filter_kernel;
cutilSafeCall(cudaMalloc((void**)&d_signal, mem_size));
cutilSafeCall(cudaMalloc((void**)&d_filter_kernel, mem_size));
// Copy host memory to device
cutilSafeCall(cudaMemcpy(d_signal, h_padded_signal, mem_size, cudaMemcpyHostToDevice));
cutilSafeCall(cudaMemcpy(d_filter_kernel, h_padded_filter_kernel, mem_size, cudaMemcpyHostToDevice));
cuMemGetInfo( &theFree, &theTotal );
printf("CARD memory after allocating memory:\n");
printf("Free: %d\nTotal: %d\nAllocated: %d\n", theFree, theTotal, theTotal-theFree);
// CUFFT plan
cufftHandle plan;
cufftSafeCall(cufftPlan3d(&plan, size_x, size_y, size_z, CUFFT_C2C));
cuMemGetInfo( &theFree, &theTotal );
printf("CARD memory after running cufftPlan3d:\n");
printf("Free: %d\nTotal: %d\nAllocated: %d\n", theFree, theTotal, theTotal-theFree);
// Transform signal and kernel
cufftSafeCall(cufftExecC2C(plan, (cufftComplex *)d_signal, (cufftComplex *)d_signal, CUFFT_FORWARD));
cufftSafeCall(cufftExecC2C(plan, (cufftComplex *)d_filter_kernel, (cufftComplex *)d_filter_kernel, CUFFT_FORWARD));
// Multiply the coefficients together and normalize the result
cutilSafeCall( cudaThreadSynchronize() );
ComplexPointwiseMulAndScale<<<GRID, THREADS>>>(d_signal, d_filter_kernel, size_x * size_y * size_z, norm_factor);
// Transform signal back
cutilSafeCall( cudaThreadSynchronize() );
cufftSafeCall(cufftExecC2C(plan, (cufftComplex *)d_signal, (cufftComplex *)d_signal, CUFFT_INVERSE));
[/indent][/font]
The output is following:
[font=“courier”]
[indent]CARD memory after allocating memory:
free: 163020544
total: 267714560
allocated: 104694016
CARD memory after running cufftPlan3d:
free: 93740800
total: 267714560
allocated: 173973760
[/indent][/font]
It seems to me the situation is the same when using 1D (maybe even 2D?) transform. In fact, the amount of data has to be approximately half of the GPU memory size. Otherwise, the program crashes because of lack of memory. Is there any chance to perform FFT without allocating additional memory?