Memory needed for CUDA 1D FFT plan creation - or how to make saussage with CUDA hacks

So I try (try being key word here) to get smart and calculate how much memory I will need to perform a large fft plan which will take multiple memory transfers due to gpu mem limitations. I do this successfully as I know how much memory the data in and out will require. Then I create a plan and I get a CUFFT_ALLOC_FAILED CUFFT failed to allocate GPU memory) error.

I guess this is to be expected since I am trying to maximize throughput. Then I try to figure out how much memory a plan requires and this is seemly a black hole. I could plot, I guess, results of N vs batch numbers on a 2D plot after creating plans by varrying N and batch number and calling cudaMemGetInfo( &gpu_free_mem_bytes, &gpu_total_mem_bytes ) and looking at the difference of gpu_total_mem_bytes - gpu_free_mem_bytes while creating and destroying plans then rinses wash repeat for each R2C, C2C… and various gpus etc. Or maybe NVIDIA could provide a better api allowing the programmer to know what resources will be needed based on N and batch sizes. This type of programming reminds me of making sausage… it’s super awesome. Since using cuda since 2.0, I am not going to hold my breath… or expect much.

The current approach to this problem is to:

// Pseudo code

Set batch to total_num_ffts needed.

last_sucessfull = 0

// Now here's where things get fabulous sausage making in progress... turn the crank... mmm sausage)

while( forever_and_a_few_nanoseconds )

	create the 2 plans. 1 forward and 1 reverse.  This allocates memory for the plan.

	allocate memory for forward data and reverse transforms (2 total - out-of-place).

	allocate memory for fft filter

	check if either of the above failed

	if failed batch is to big

		deallocate all requested gpu memory

		prev = batch

		// go half way to last sucessfull

		batch = batch - (batch - last_sucessfull)/2;

	end

	if sucessfull

		// we may have a good batch num, but because of bidirecional divide by 2 search

		// it could be too small

		last_sucessful = batch

		temp = batch;

		// go half way between previous (failed) and current batch.

		batch = batch + (prev - batch) / 2;

		

	end	

	// See if we found a good batch for our plan

	if( prev == batch )

		sausage_making = complete

		// the below needs to be checked as state of gpu could change next time arround

		serialize magic batch num to a file for later use for this gpu 

		break // out dancing

	end

	

	

end

		

use batch and total_num_batches to chunk up data using a planner for multiple gpu transfers

This is the current … err ummm … approach

This could have been avoided (and maybe there is another way) if the createPlan1D,2D,3D,Many were of the form:

cudaReal* data_buff

cudaComplex* fft_buff

size_t num_chunks;

// Why batch is int and not size_t ??? same could be asked for all functions in this lib

int batch = num_batches_requested;

size_t  num_overflow_mod_remainer

// this would provided guaranteed memory allocation at the time of the request

cufftPlan1d(

	cufftHandle* plan, int nx,cufftType type,

	int& batch, 

	size_t& num_chunks,

	size_t& num_overflow_mod_remainer

	void* data_buff, void * fft_buff 

	ALLOC_MEM

	);

	

	

so batch requested is  sent in, but batch is updated with 

what is possible along with num_chunks (number of times 

to perform gpu data transfers) and  

num_overflow_mod_remainer either zero or remaining 

number of batches which must be performed in last 

remaining chunk. so total num chunks is num_chunks+1 if 

there are remaining uneven chunks.

	

or if a planner function could be provided

	

	

cudafftPlanPossible(

	cufftHandle* plan, int nx, cufftType type, 

	size_t& batch, size_t& num_chunks, 

	size_t& num_overflow_mod_remainer

)

{

	calculate what is possible and return batch, num_chunks, and num_overflow_mod_remainer to user

	num_chunks = total_num_ffts / batch  as flored int

	num_overflow_mod_remainer is the number of overflow: total_num_ffts % batch

}

Performance has it’s price… paid in the denomination of asrprin tablets.

from eariler post:

void* data_buff, void * fft_buff

should be

void** data_buff, void ** fft_buff

as these could be set by the proposed function

Also from testing the number of batches per chunk turns out to be 2059 on Quatro 1700M which is equal to maxThreadsPerBlock for this processor. So this may not be a memory limitation, but rather a plan limitation on the cuda fft libirary based on threads per block of the gpu? So now it appears there are two limitations 1) memory and 2) max batches that can be processed based on maxThreadsPerBlock and fft library implementation. Of course there is little to no documentation on what batch can be… i.e. NO, NONE, NADA, ZIP, ZIPPO, ZERO, much less than some documentation.

.

Above was found to be incorrect due to a bug in how the data structure (cudaDeviceInfo) that held this information was being passed. At the time it did not seem correct to me as the thread dimensions being reported did not seem correct.

actual values are:

512 threads per block so batches per chunk are

num_batches 2503 int

maxThreadsPerBlock 512 int

which means the plan is not limited by the threads per block, which is back to making sense. Therefore the limiting factor is back to being that of memory which as stated earlier is not easy to determine ahead of time.

Thought I would correct this for those who are interested.

I reactivate that post because I am facing troubles with cufft which look related to the memory management but couldn’t find much info or help.

Could someone give us some keys about memory allocation in cufft ?

Here is what I am facing :

  • I use cufftPlan1d to allocate severall plans.
  • I check the result of the calls which is always cufft_success.
  • But when executing the fft, it sometimes crashes with cufft_exec_failed error.

I tried to walk around by checking the available memory before and after the cufftPlan1d call and consider it fails if no memory was allocated. But it many cases, the free memory does not decrease (so no memory seems to have been allocated…) but the execution is still fine.

2nd problem : in some cases the execution does not crash but no filtering occurs…

I am using cuda 3.2 under winxp64 and linux ubuntu 64. I have to precise that it never crashed under windows until now…

Hey guys, is cufft a kind of taboo External Image External Image ?? So many questions on the forum and so little answers… External Image

What size FFT are you trying to do? There is an upper limit although I’m not sure what that currently is.

As far as plans go, I used a brute force method to figure out that C2C 1D FFT plans require about 8*FFTSIZE Bytes of memory for FFTSIZE > 64k and 1 MByte for FFTSIZE <= 64k, regardless of batch size. I didn’t test for FFTSIZES > 1024k though. If anyone has better information regarding memory use of FFT plans I would be very interested.

Thank you Charley, it feels good not to be alone External Image

The size of the signal is a power of 2, typically 1024, 2048, sometimes 4096. The batch size varies from a few units to a few hundreds. Nothing terrific…