Determine Memory CUDA Context Memory Usage

Is there a way to determine how much GPU memory creating a context will require?

I have a multi-process system where I need to load balance/limit memory across GPUs for 10s of processes.
I can determine how much memory “I think” CUDA will use by pre-auditing my buffers with known sizes. I realize this might not be a 1-1 relationship depending on how the GPU allocates memory but it appeared to be close.

I just moved my code from a GeForce GT 640 to a TITAN X (Pascal) card and what I thought would consume around 30 mb is taking roughly 181 mb where as on the GT 640 it was around 50 mb. Looks like just instantiating a CUDA context takes about 149 mb on the TITAN X. Is there a way to pre-determine this amount? Will it vary across different cards?

Thanks!

Hi,

Memory usage may be different across platform due to architecture.

You can check this API for setting the limitation for CUDA context:
https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__CTX.html#group__CUDA__CTX_1g0651954dfb9788173e60a9af7201e65a
[i]--------------------------------------------------------------------
CUresult cuCtxSetLimit ( CUlimit limit, size_t value )
Set resource limits.

Parameters
limit

  • Limit to set
    value
  • Size of limit

Returns
CUDA_SUCCESS, CUDA_ERROR_INVALID_VALUE, CUDA_ERROR_UNSUPPORTED_LIMIT, CUDA_ERROR_OUT_OF_MEMORY, CUDA_ERROR_INVALID_CONTEXT

Description
Setting limit to value is a request by the application to update the current limit maintained by the context. The driver is free to modify the requested value to meet h/w requirements (this could be clamping to minimum or maximum values, rounding up to nearest element size, etc). Note that the CUDA driver will set the limit to the maximum of value and what the kernel function requires. The application can use cuCtxGetLimit() to find out exactly what the limit has been set to.[/i]

Thanks.

Hi thanks for the reply!
So there isn’t a way to determine at runtime the amount of memory a context will require for a given platform?
We don’t require our customers to use a specific GPU architecture only that it be NVIDIA. In order to alleviate the problem of running out of CUDA memory we would need to know up front before we launch a specific task how much GPU memory it will use and revert to CPU only processing if it will exceed that limit.

Any more help would be greatly appreciated!

Thanks!

Hi,

Suppose it is possible to control the memory usage for CUDA.

Could you help to profile where the unknown memory usage from?
All the GPU related task may allocate memory, not limited to CUDA kernel.

It will be good to find all the source of consumption first.
You can modify a profiler from this comment:
https://devtalk.nvidia.com/default/topic/1013464/jetson-tx2/gpu-out-of-memory-when-the-total-ram-usage-is-2-8g/post/5168834/#5168834

Thanks.

Hi,
I’m not necessarily trying to control the memory usage, but merely trying to find an estimate of how
much memory my tasks will take before I launch them so I don’t run out. I’ve already collected all of the memory allocations upfront but when launched on different architectures and boards my process consumes different amounts of memory even when processing the same data using the same settings.

I’m just looking to get an order of magnitude of consumption like the following:
If I process data at resolution 1280x720 on board A, it will use 10mb of memory for all of my buffers and Xmb of memory for the CUDA context.

However what I’m seeing like from about is on a GeForce GT 640 board I’m consuming 30mb where on a TITAN X (Pascal) card I’m using roughly 181 on the same data.

(I’m using both the nvidia-smi and cudaMemGetInfo functions to determine the usage.)

Hi,

Sorry for the late reply.

The context may vary from the kernel implementation.
For example, shared memory, read/write operation, …

Could you give us a simple source of your use case?
So we can check it with our internal team for you?

Thanks.

Hi,

Are you looking for source code?

If so, what would help the most? We have lots of CUDA code split throughout our pipeline with allocations for various algorithms, mostly flat arrays of floats or uchar4. We utilize NPP as well, I’m not sure if there is overhead for that?

Our kernels do utilize some shared memory for some reductions but very little.

Thanks,
Jay

Hi,

YES.

Could you share a sample that memory different a lot cross platform?
We want to pass this sample to our internal team for comment.

Thanks.

Here is a simple application that allocates 10 buffers of 1280x720x4:
Here is a download link to the code an images that show task manager settings and output of nvidia-smi.

Windows reports a 130 mb difference and nvidia-smi reports 167mb diff.

https://files.secureserver.net/0fT5be5Q8pDaWR

#include "stdafx.h"
#include <conio.h>
#include "cuda_runtime.h"

#pragma comment(lib, "cuda")

int _tmain(int argc, _TCHAR* argv[])
{
  cudaError_t error = cudaSetDevice( 0 );
  if( error != cudaSuccess )
  {
    printf( "Unable to set cuda device\n" );
    return 0;
  }
  
  int nWidth = 1280;
  int nHeight = 720;
  int nChannels = 4;
  int nSize = nWidth*nHeight*nChannels;

  int* pMem[10];

  for( int i = 0; i < 10; ++i )
  {
    error = cudaMalloc( (void**) &pMem[i], nSize );
    if( error != cudaSuccess )
    {
      printf( "Unable to allocated %d bytes (%d)\n", nSize, i );
      return 0;
    }
  }

  size_t free, total;
  cudaMemGetInfo( &free, &total );

  printf( "Free mem = %d, Total = %d\n", free, total );

  _getch();

  for( int i = 0; i < 10; ++i )
  {
    error = cudaFree( pMem[i] );
  }

  cudaDeviceReset();

	return 0;
}

Hi,

Thanks for your sample.
We can reproduce this issue and is checking with our internal team.

Will update information with you later.

ok

Hi,
Is there any update on this?
Thanks!
jay

Hi,

Here is some information from our internal team.

The pre-allocated memory amount is related to GPU SMs number.
The GPU with more SMs requires a larger memory.

Currently, there is no reliable mechanism to measure it across different GPUs.
You will need to test it directly on the target to get the informaiton.

Thanks.

Hi,

Can you recommend a way to test it on a target machine?

Is just initializing a context and allocating a minimal amount memory enough or do I need to test different memory footprints on each target (meaning will the memory on each target scale linearly)?

Thanks,
Jay

Hi,

It’s recommended to test the most heavy case of your application.

Thanks.

Hi,

Is there any way to reduce pre-allocated memory? In my case, pre-allocated memory is much larger than my model size and it consumes most of the memory. For compatibility, I use runtime API to allocate and free memory.

Thanks

Yes - follow the first link posted in this thread, or look here for the Runtime API equivalent.