Is there a way to determine how much GPU memory creating a context will require?
I have a multi-process system where I need to load balance/limit memory across GPUs for 10s of processes.
I can determine how much memory “I think” CUDA will use by pre-auditing my buffers with known sizes. I realize this might not be a 1-1 relationship depending on how the GPU allocates memory but it appeared to be close.
I just moved my code from a GeForce GT 640 to a TITAN X (Pascal) card and what I thought would consume around 30 mb is taking roughly 181 mb where as on the GT 640 it was around 50 mb. Looks like just instantiating a CUDA context takes about 149 mb on the TITAN X. Is there a way to pre-determine this amount? Will it vary across different cards?
Description
Setting limit to value is a request by the application to update the current limit maintained by the context. The driver is free to modify the requested value to meet h/w requirements (this could be clamping to minimum or maximum values, rounding up to nearest element size, etc). Note that the CUDA driver will set the limit to the maximum of value and what the kernel function requires. The application can use cuCtxGetLimit() to find out exactly what the limit has been set to.[/i]
Hi thanks for the reply!
So there isn’t a way to determine at runtime the amount of memory a context will require for a given platform?
We don’t require our customers to use a specific GPU architecture only that it be NVIDIA. In order to alleviate the problem of running out of CUDA memory we would need to know up front before we launch a specific task how much GPU memory it will use and revert to CPU only processing if it will exceed that limit.
Hi,
I’m not necessarily trying to control the memory usage, but merely trying to find an estimate of how
much memory my tasks will take before I launch them so I don’t run out. I’ve already collected all of the memory allocations upfront but when launched on different architectures and boards my process consumes different amounts of memory even when processing the same data using the same settings.
I’m just looking to get an order of magnitude of consumption like the following:
If I process data at resolution 1280x720 on board A, it will use 10mb of memory for all of my buffers and Xmb of memory for the CUDA context.
However what I’m seeing like from about is on a GeForce GT 640 board I’m consuming 30mb where on a TITAN X (Pascal) card I’m using roughly 181 on the same data.
(I’m using both the nvidia-smi and cudaMemGetInfo functions to determine the usage.)
If so, what would help the most? We have lots of CUDA code split throughout our pipeline with allocations for various algorithms, mostly flat arrays of floats or uchar4. We utilize NPP as well, I’m not sure if there is overhead for that?
Our kernels do utilize some shared memory for some reductions but very little.
Here is a simple application that allocates 10 buffers of 1280x720x4:
Here is a download link to the code an images that show task manager settings and output of nvidia-smi.
Windows reports a 130 mb difference and nvidia-smi reports 167mb diff.
Can you recommend a way to test it on a target machine?
Is just initializing a context and allocating a minimal amount memory enough or do I need to test different memory footprints on each target (meaning will the memory on each target scale linearly)?
Is there any way to reduce pre-allocated memory? In my case, pre-allocated memory is much larger than my model size and it consumes most of the memory. For compatibility, I use runtime API to allocate and free memory.