too much global memory occupication

the code is as follows:


int main()
	int * a;
	return 0;

I only allocate 4 bytes in global memory, but in fact, when I show gpu usage using “nvidia-smi” command,174M memory is used. I can’t find out why… …

Apart from the 4 bytes for holding an integer value, probably rounded up to a multiple of the GPU’s page size, any running CUDA code also occupies GPU memory for a context.

The context holds various important data structures like the heap for device-side memory allocations, stack space for all (tens of thousands of) threads that can potentially run in parallel, the FIFO that buffers device side printf() output, space to store the entire internal state of the GPU two times (by default) over in case cudaDeviceSynchronize() is called on the device, and lots of other documented and undocumented stuff necessary to make CUDA work. Some of these data structures are configurable in size by calls to cudaDeviceSetLimit(), so if you know you are not using a feature you can (somewhat) reduce the amount of memory required.

Thank you !!!

Is there any way to compute or predict the size of a CUDA context from the driver API, without allocating one and measuring its occupacy with NVML stuff?

There is an occupancy calculator spreadsheet available in the \Program Files\Nvidia GPU Computing Toolkit\CUDA\v10.1\tools directory. That might help.

The CUDA Occupancy calculator spreadsheet computes the multiprocessor occupancy of a given CUDA kernel.

What I am talking about is the amount of memory occupied by the single CUDA context. Take for example this simple code:

#include <cuda.h>
int main() {
  CUcontext ctx;
  CUdevice device = 0;

  cuDeviceGet(&device, 0);
  cuCtxCreate(&ctx, CU_CTX_SCHED_AUTO, device);

If you check the code with nvidia-smi on a V100, this would consume 305MB of memory. On a P100 takes approximately 280MB. When dealing with many contexts or loading different GPU accelerated applications on a compute node, the sum of these context can become important in estimating how many instances of such applications can fit into the compute node concurrently. That’s way it is important to have a good estimate of such occupation before submitting the jobs.

So I ask again: is there any way or API to know how much memory will be consumed by a context?

There is no API.

You were given a basically empirical proposal in the comments here:

Such empirical methods could change with CUDA version, GPU type, or the phase of the moon.