Adjusting CUDA buffer sizes by GPU type

Hi, I have some legacy CUDA code with a few kernels and quite a few
large arrays. In most cases these are shadowed, ie the
array exists on both host and GPU. This is to ease the orgnanisation of transfers
between the host and the GPU and (in some cases) back again.

As more powerful GPUs have been introduced,
we have been setting the size of some of the buffers by the
amount of free space on the GPU. Mostly this works. But recently the range
of size of data to be processed by the user has also increased and sometimes
the buffer size calculation has gone wrong. I guess we should just fix the
bug, but there are many arrays of different sizes and keeping track of
how much memory we think they will take compared to how much memory CUDA
thinks is free on the GPU, is getting increasingly complicated (and so error
prone).

Has anyone else experienced this problem?

I was wondering about making our code more resilient, in the sense that instead
of aborting when CUDA says there is no memory left on the card, we should free
everything, half the buffer size and try again.

Can anyone see problems with this?
Will this be ok on virtual machines running in the cloud?
Will this fragment the GPU’s memory?
Should we also reset the GPU?
Any views on downsides of doing a reset?
(increasingly the GPU is in a remote machine room)

If we have to do the buffer calculation properly, does anyone have any advice
on the best way to organise out code so that maintaining it is not such
a nightmare?

Getting the buffer size right (ie as big, but no bigger, will support) seems to
give a 10-20% performance boost, so it seems worth doing.

Any help or advice would be most welcome

Bill
http://www.cs.ucl.ac.uk/staff/W.Langdon/

I assume you already know how to check both the current available memory and the total;

size_t free_available_gpu_memory,total_available_gpu_memory;

err=cudaMemGetInfo(&free_available_gpu_memory, &total_available_gpu_memory); 
if(err != cudaSuccess){ printf("%s in %s at line %d\n", cudaGetErrorString(err), __FILE__, __LINE__); }

As your application proceeds you can compare the dynamic requirements to the available and make corresponding adjustments based on that information.

Also if you are using the cuFFT library it will allocate some temporary memory for intermediate calculations, and the above will also consider those allocations as well;

fft_stat=cufftPlanMany(&forward_plan,1,&n,NULL,1,paddedChannels,NULL,1,elem_complex_padd_single,CUFFT_R2C,f_rows_batch);
if(fft_stat!=CUFFT_SUCCESS){printf("CUFFT Error # %d in %s at line %d\n",fft_stat,__FILE__,__LINE__);}

fft_stat=cufftGetSize(forward_plan, &bytes_fft_plan);
if(fft_stat!=CUFFT_SUCCESS){printf("CUFFT Error # %d in %s at line %d\n",fft_stat,__FILE__,__LINE__);}

printf("\nTotal device memory required by forward batched fft plan =%llu bytes\n",bytes_fft_plan);

In order to free that memory you need to invoke cufftDestroy like this;

fft_stat=cufftDestroy(forward_plan);
if(fft_stat!=CUFFT_SUCCESS){printf("CUFFT Error # %d in %s at line %d\n",fft_stat,__FILE__,__LINE__);}
fft_stat=cufftDestroy(inverse_plan);
if(fft_stat!=CUFFT_SUCCESS){printf("CUFFT Error # %d in %s at line %d\n",fft_stat,__FILE__,__LINE__);}

You may already know this information, but figured I would mention just in case. not sure what else you are asking about.

Dear CudaaduC,
Thank you. We are not using CUFFT but thank you for the information.
Yes we are using cudaMemGetInfo() so I guess my question still remains.

Has anyone else hit the problem of sizing buffers in code which
should run with out upsetting the users on GPUs with different memory
sizes and where the volume of user data varies.

There are multiple buffers with potentially different alignments
but even if I work out in advance exactly how to set my buffer size,
there are potential problems with variable (and undocumented) space
taken by CUDA and fragmentation of the CPU’s global memory.

Perhaps an example would help.
If the user has an 8GB board and 6.2GB of data I would like my
code to run with buffers filling the remaining 1.8GB,
if they have a 25GB tesla and 3.2 of data, expanding the buffer
into the remaining 21.8GB gives a small but worthwhile performance
boost.

Does allocate/free risk fragmenting global memory.
Does GPU device reset risk messing up virtual machines or operation in the cloud.
What is the best way forward?

Many thanks
Bill