cuDevicePrimaryCtxRetain returns CUDA_ERROR_OUT_OF_MEMORY

Hi,

I have a system with 12 RTX 3060 / 12GB available memory.

When I call cuDevicePrimaryCtxRetain for them I get for the last 2 GPUs a CUDA_ERROR_OUT_OF_MEMORY error. The Host have 4GB RAM. Is there a way to reduce the memory required for the context?

Thanks,
Daniel

Is it really 4 GB? In my experience, at present a system could get by with 8 GB of system memory for light desktop duty, 32 GB for a workstation class environment with some heavy computational lifting (possibly involving GPUs). How much memory is available to user processes after booting the operating system? What is the operating system?

According to conventional wisdom, a well-balanced GPU-accelerated system should have more system memory than total GPU memory combined, ideally 2x to 4x this amount depending on use case. What is the use case envisioned for this 12 GPU system?

Hi, yes it is 4GB on Windows 11. I develop a trading system for fast moving markets as futures and forex. With 8GB it seems to work fine. I was just wondering if there are options to have some impact of the memory consumption of the cuda/driver interface.

Thanks - will go with 8GB.

I would look at it from the opportunity cost perspective: The cost difference between equipping the system with 8 GB instead of 4 GB is minimal to non-existent (when viewed across multiple DRAM suppliers), and the value of any time spent trying to find to squeeze the CUDA software stack for 12 GPUs into the smaller system memory is likely to exceed that cost.

To my knowledge, there are no software configuration knobs that reduce the memory requirements of the CUDA software stack. These requirements are quite modest and easily met by any reasonably configured system. I will note that I would not consider a system with 8 GB of system memory and 12 GPUs as such.

Under-provisioning of system memory is a fairly common performance-reducing issue in GPU accelerated systems. 12 GPUs can chew through a lot of data, and that data needs to be shuffled in and out of the GPUs via system memory (given that you use consumer parts under Windows, I take it as a given that you are not using GPUdirect), even if it just serves as a buffer for data coming in from disk and network.

Now, conventional wisdom and common observations may not apply to a specific (and possibly very unusual) use case. But system memory size vs performance of this CUDA-accelerated application is something you might want to keep an eye on.