As I am migrating my application from OpenCL to Cuda (driver api), I realized that my new app was working as expected, but the host memory usage was way higher then using the OpenCL track before. Thus I excluded all of the code for memory allocation and did just run the cuDevicePrimaryCtxRetain + cuCtxSetCurrent driver api functions to give each of my worker thread a context for each GPU in the system. For some reason the host memory use is almost as high as with the full app running, so this seems to be the source of the problem. The memory use when using cuCtxCreate instead of retaining the primary context was similar - so no difference there.
The difference between running all the cards on OpenCL only and running amd cards on OpenCL + Nvidia cards on Cuda is almost 100 MByte per GPU.
Any idea why there is such a big memory use by retaining the context?
Ps: for the sake of completeness: Used driver is 460.39 in Linux, Tested GPUs were and RTX 2060 and a 3070.