Apart from the 4 bytes for holding an integer value, probably rounded up to a multiple of the GPU’s page size, any running CUDA code also occupies GPU memory for a context.
The context holds various important data structures like the heap for device-side memory allocations, stack space for all (tens of thousands of) threads that can potentially run in parallel, the FIFO that buffers device side printf() output, space to store the entire internal state of the GPU two times (by default) over in case cudaDeviceSynchronize() is called on the device, and lots of other documented and undocumented stuff necessary to make CUDA work. Some of these data structures are configurable in size by calls to cudaDeviceSetLimit(), so if you know you are not using a feature you can (somewhat) reduce the amount of memory required.
Is there any way to compute or predict the size of a CUDA context from the driver API, without allocating one and measuring its occupacy with NVML stuff?
If you check the code with nvidia-smi on a V100, this would consume 305MB of memory. On a P100 takes approximately 280MB. When dealing with many contexts or loading different GPU accelerated applications on a compute node, the sum of these context can become important in estimating how many instances of such applications can fit into the compute node concurrently. That’s way it is important to have a good estimate of such occupation before submitting the jobs.
So I ask again: is there any way or API to know how much memory will be consumed by a context?