So I try (try being key word here) to get smart and calculate how much memory I will need to perform a large fft plan which will take multiple memory transfers due to gpu mem limitations. I do this successfully as I know how much memory the data in and out will require. Then I create a plan and I get a CUFFT_ALLOC_FAILED CUFFT failed to allocate GPU memory) error.
I guess this is to be expected since I am trying to maximize throughput. Then I try to figure out how much memory a plan requires and this is seemly a black hole. I could plot, I guess, results of N vs batch numbers on a 2D plot after creating plans by varrying N and batch number and calling cudaMemGetInfo( &gpu_free_mem_bytes, &gpu_total_mem_bytes ) and looking at the difference of gpu_total_mem_bytes - gpu_free_mem_bytes while creating and destroying plans then rinses wash repeat for each R2C, C2C… and various gpus etc. Or maybe NVIDIA could provide a better api allowing the programmer to know what resources will be needed based on N and batch sizes. This type of programming reminds me of making sausage… it’s super awesome. Since using cuda since 2.0, I am not going to hold my breath… or expect much.
// Pseudo code
Set batch to total_num_ffts needed.
last_sucessfull = 0
// Now here's where things get fabulous sausage making in progress... turn the crank... mmm sausage)
while( forever_and_a_few_nanoseconds )
create the 2 plans. 1 forward and 1 reverse. This allocates memory for the plan.
allocate memory for forward data and reverse transforms (2 total - out-of-place).
allocate memory for fft filter
check if either of the above failed
if failed batch is to big
deallocate all requested gpu memory
prev = batch
// go half way to last sucessfull
batch = batch - (batch - last_sucessfull)/2;
end
if sucessfull
// we may have a good batch num, but because of bidirecional divide by 2 search
// it could be too small
last_sucessful = batch
temp = batch;
// go half way between previous (failed) and current batch.
batch = batch + (prev - batch) / 2;
end
// See if we found a good batch for our plan
if( prev == batch )
sausage_making = complete
// the below needs to be checked as state of gpu could change next time arround
serialize magic batch num to a file for later use for this gpu
break // out dancing
end
end
use batch and total_num_batches to chunk up data using a planner for multiple gpu transfers
This is the current … err ummm … approach
This could have been avoided (and maybe there is another way) if the createPlan1D,2D,3D,Many were of the form:
cudaReal* data_buff
cudaComplex* fft_buff
size_t num_chunks;
// Why batch is int and not size_t ??? same could be asked for all functions in this lib
int batch = num_batches_requested;
size_t num_overflow_mod_remainer
// this would provided guaranteed memory allocation at the time of the request
cufftPlan1d(
cufftHandle* plan, int nx,cufftType type,
int& batch,
size_t& num_chunks,
size_t& num_overflow_mod_remainer
void* data_buff, void * fft_buff
ALLOC_MEM
);
so batch requested is sent in, but batch is updated with
what is possible along with num_chunks (number of times
to perform gpu data transfers) and
num_overflow_mod_remainer either zero or remaining
number of batches which must be performed in last
remaining chunk. so total num chunks is num_chunks+1 if
there are remaining uneven chunks.
or if a planner function could be provided
cudafftPlanPossible(
cufftHandle* plan, int nx, cufftType type,
size_t& batch, size_t& num_chunks,
size_t& num_overflow_mod_remainer
)
{
calculate what is possible and return batch, num_chunks, and num_overflow_mod_remainer to user
num_chunks = total_num_ffts / batch as flored int
num_overflow_mod_remainer is the number of overflow: total_num_ffts % batch
}
Performance has it’s price… paid in the denomination of asrprin tablets.
Also from testing the number of batches per chunk turns out to be 2059 on Quatro 1700M which is equal to maxThreadsPerBlock for this processor. So this may not be a memory limitation, but rather a plan limitation on the cuda fft libirary based on threads per block of the gpu? So now it appears there are two limitations 1) memory and 2) max batches that can be processed based on maxThreadsPerBlock and fft library implementation. Of course there is little to no documentation on what batch can be… i.e. NO, NONE, NADA, ZIP, ZIPPO, ZERO, much less than some documentation.
Above was found to be incorrect due to a bug in how the data structure (cudaDeviceInfo) that held this information was being passed. At the time it did not seem correct to me as the thread dimensions being reported did not seem correct.
actual values are:
512 threads per block so batches per chunk are
num_batches 2503 int
maxThreadsPerBlock 512 int
which means the plan is not limited by the threads per block, which is back to making sense. Therefore the limiting factor is back to being that of memory which as stated earlier is not easy to determine ahead of time.
Thought I would correct this for those who are interested.
I reactivate that post because I am facing troubles with cufft which look related to the memory management but couldn’t find much info or help.
Could someone give us some keys about memory allocation in cufft ?
Here is what I am facing :
I use cufftPlan1d to allocate severall plans.
I check the result of the calls which is always cufft_success.
But when executing the fft, it sometimes crashes with cufft_exec_failed error.
I tried to walk around by checking the available memory before and after the cufftPlan1d call and consider it fails if no memory was allocated. But it many cases, the free memory does not decrease (so no memory seems to have been allocated…) but the execution is still fine.
2nd problem : in some cases the execution does not crash but no filtering occurs…
I am using cuda 3.2 under winxp64 and linux ubuntu 64. I have to precise that it never crashed under windows until now…
What size FFT are you trying to do? There is an upper limit although I’m not sure what that currently is.
As far as plans go, I used a brute force method to figure out that C2C 1D FFT plans require about 8*FFTSIZE Bytes of memory for FFTSIZE > 64k and 1 MByte for FFTSIZE <= 64k, regardless of batch size. I didn’t test for FFTSIZES > 1024k though. If anyone has better information regarding memory use of FFT plans I would be very interested.
Thank you Charley, it feels good not to be alone External Image
The size of the signal is a power of 2, typically 1024, 2048, sometimes 4096. The batch size varies from a few units to a few hundreds. Nothing terrific…