Is there a maximum number of contexts per GPU encoded into the driver?

As the question says, is there an upper limit of the number of contexts which can be created on a single GPU at once, and is this information documented or able to be queried in some way?

The specific situation I have is we use a GPU-Accelerated molecular simulation software which creates either a CUDA or OpenCL context per simulation, and we often find ourselves running multiple simulations in parallel on the same GPU. We have the problem that after the 20th context, we cannot create any more, and I don’t have access to the full error error, only what the software spits out.

That said, the dev of the simulation software, and our own cluster IT folks seem to think that there is a hard upper limit, hence the question. Details of hard-/software below:

GTX-1080 and GTX-1080Ti
Driver version 384.81
CUDA and OpenCL platforms
Running in Default/Shared operation mode without MPS (of which I know there is a hard limit of 1 and 16 respectively in those modes)

Any help would be great, but I also know the details may be lacking, so I can try to get more information if requested.

Thanks in advance!

OpenCL is off topic here. You didn’t mention your operating system platform, which is possibly significant to any limits that may exist. Your host system configuration could also be relevant. How many of the GTX 1080 (Ti) are present per host system?

When CUDA context creation fails, what error code is being reported by CUDA? Does the app perform status checks on all CUDA API calls and kernel launches? If not, a failing CUDA API call, including failing context creation, could be a follow-on error to an earlier, undetected failure.

Have you tried to establish the presence of a hard limit of 20 by creating some simple dedicated test scaffolding that does nothing else but creating CUDA contexts until that fails? If not, I would suggest doing so, as it would provide a useful baseline for further investigation.

Obviously, each CUDA context requires resources, e.g. memory. The amount of resources required may differ based on the specific features used by each context, GPU architecture, OS, etc. Since any of the resources consumed is finite, we know that the number of CUDA contexts that can be created surely must be finite.

That said, I am not aware of, nor have I ever encountered, a situation where the number of CUDA contexts encounters a fixed upper limit of 20. Hardware limits are commonly of the form 2ⁿ or 2ⁿ-1, so it seems unlikely that your are hitting one of those (e.g. number of hardware channels provided by the GPU). Even limits imposed by software often use power-of-two limits (e.g. number of handles). So I suspect something else is going on. Maybe a non-obvious out-of-resources condition or a software bug (anywhere in the software stack). As a rudimentary test of the latter hypothesis you could try installing the latest available drivers.

I wonder whether there could be a case of resource leakage here, possibly caused by the abnormal termination of CUDA-accelerated applications. To eliminate this possibility, it would make sense to perform each new experiment on a freshly booted system.

Its a managed cluster where there are identical GPU (GTX 1080 or 1080Ti) per node, but they are requested by job, so any number can be requested, but the failure occurs at 20 Context/GPU, so if 2 are requested, 40 Contexts can be created before an error is thrown.

OS is CentOS 7.3

Error reported by CUDA is “Error initializing context: clCreateContext (-6)”,

more than that will take a bit more time for me to delve into.

I wonder whether there could be a case of resource leakage here

That would be my concern as well, but there are some other odd behavior I see which indicates either a bug, a resource leak I don’t know how to debug, or a hard limit:

  • Physical memory on the GPU is not exceeded before the 20 context limit is reached. If I look at the nvidia-smi output, only about 1/4 of the memory is used on the GPU
  • 20 Context per GPU limit appears to be a function of the number of visible GPUs to the process. If 2 GPUs are seen on the host for the job, then 40 context can be created. But if 2 GPUs are seen on the host and CUDA_VISIBLE_DEVICES=0 so only the first GPU is accessible, then only 20 contexts can be created
  • If 2 GPUs are available, 40 contexts can be created, but they will all be on the same GPU. E.g. If 20 contexts use 1/4 of the physical memory, then after 40 contexts, one GPU will have 1/2 memory consumed, and the other will have not been used as all, but the error is still thrown.

These sorts of symptoms are what led both the dev and our cluster it managers to believe it was a hard-coded limit in either the GPU architecture or the driver SW, and sent me here.

This might also be the wrong place to work through this, so I’m happy to take this elsewhere more appropriate if need be.

That looks very much like an OpenCL error to me, not a CUDA one (“cl” prefix instead of “cu” or “cuda” prefix). As the name of this forum indicates, it is for CUDA issues. Personally, I have never used OpenCL.

I am very biased, as I worked on CUDA for nine years. With that caveat, I would strongly advise against the use of OpenCL on NVIDIA hardware. It does not seem to have been under active development for the past five years or so.

As stated earlier, instead of working with the full app (presumably large and cumbersome), I would suggest working out the relevant circumstances of this failure with a dedicated minimal test app that focuses just on the context creation portion.

You may want to check the OpenCL specification whether it imposes any restrictions on memory used for contexts versus memory available to user applications.

There have been new developments (new, publicly exposed capability) in the NVIDIA OpenCL driver in the last 5 years.

However I personally have a lot less experience with OpenCL compared to CUDA. I could probably make a few comments about CUDA contexts, but it’s nothing that can’t be arrived at by others, and I would hesitate to suggest that any conclusions reached about CUDA context behavior are directly useful in understanding OpenCL context behavior.

Having said all that, the existence of a hardcoded limit in the driver would not surprise me at all; you seem to have made a reasonable effort of “proving” it. Furthermore, I know of no publishing of such as a specification by NVIDIA, either on the CUDA or OpenCL side, and know of no handles or controls over it, either on the CUDA or OpenCL side, for NVIDIA implementation.

If this limit is an issue, you may want to see if you can work around it by rearchitecting the overall workflow to get what you want done within a limited context footprint.

If you’d like to see a change in NVIDIA CUDA or OpenCL behavior, you’re welcome to file bugs, or bugs marked with RFE in the summary, to describe what you’d like.

http://developer.nvidia.com

You need to be a registered developer to use the bug reporting portal.