Host memory use of retained primary context

Hey there,

As I am migrating my application from OpenCL to Cuda (driver api), I realized that my new app was working as expected, but the host memory usage was way higher then using the OpenCL track before. Thus I excluded all of the code for memory allocation and did just run the cuDevicePrimaryCtxRetain + cuCtxSetCurrent driver api functions to give each of my worker thread a context for each GPU in the system. For some reason the host memory use is almost as high as with the full app running, so this seems to be the source of the problem. The memory use when using cuCtxCreate instead of retaining the primary context was similar - so no difference there.
The difference between running all the cards on OpenCL only and running amd cards on OpenCL + Nvidia cards on Cuda is almost 100 MByte per GPU.
Any idea why there is such a big memory use by retaining the context?

Ps: for the sake of completeness: Used driver is 460.39 in Linux, Tested GPUs were and RTX 2060 and a 3070.

FWIW I ran the following code on a GTX 960 on Fedora with CUDA 11.3 (465.19.01). top reported 3.3% of 2GB host memory used by the app, so that is about 66MB total.

#include <cuda.h>
#include <unistd.h>

int main(){

        sleep(10);
        cuInit(0);
        CUdevice dev;
        CUcontext ctx;
        cuDeviceGet(&dev, 0);
        cuCtxCreate(&ctx, CU_CTX_SCHED_YIELD, dev);
        sleep(10);
        }

I think its entirely possible that other GPU types may use differing amount of memory.

On a system with 8 V100 GPUs (455.23.05), and 192GB, with a modification to above code to open a context on more than 1 GPU, I observe:

contexts/GPUs      top report
1                   0.1%
4                   0.3%
8                   0.5%

For 8 GPUs the cost per GPU is in the ballpark of 120MB. It does appear to be some kind of proportional increase in host memory usage per GPU. I would assume there are various host data areas needed by the CUDA driver API to do housekeeping, and this appears that it may include some one-time overhead as well as some per-device overhead. I don’t know what the contents are specifically.