Does sticky CUDA error affect other host processes using the same GPU?

This is a question about CUDA runtime and error handling.
I have an “illegal memory access” error in my process, and I find that when it occurs, all other processes have trouble using the GPU. Other users on the computer are affected. But as far as I know, the CUDA context in my particular process is corrupted, but how come that it affect other people using the GPU? When I kill my process, everything works fine.

If your “illegal memory access” error/process also had a lot of memory reserved, then other processes are going to have trouble using the GPU. Once you kill the process, memory is released.

Thanks for reply, but in nvidia-smi I can see there are still several GBs of GPU memory available. But I couldn’t use any CUDA runtime function then. Couldn’t get DeviceCount, it throws a “CUDA unkown error”.

I suppose another possibility is that you have the compute mode on the GPU set to exclusive process, or equivalently that you are using MPS.

Assuming that is not the case, I can’t really explain your observation.

I tried a test case on two different setups, one with a V100 and the other with a GTX 970. I run an application that does an illegal operation in device code, and the host code spins forever in a while loop. I verify that application is still running on that GPU using nvidia-smi. Then I run vectorAdd from a separate process, on the same GPU. In both cases (CUDA 11.4, V100, and CUDA 11.7,GTX 970) the vectorAdd produced normal results (no errors).

So I’m not sure what leads to the condition that you observe.

2 Likes

Yes I am using MPS, and yes if I don’t use MPS, I can normally use GPU even when the error occurs without killing that process. I’m using MPS but didn’t set the compute mode to exclusive, still default.
So does this imply that using MPS is equivalent to setting exclusive compute mode, and in this mode, only one context is allowed one the GPU, so that when it is corrupted, no process can use the GPU. Sorry I didn’t have a clear understanding of these concepts.
And thank you very much for answering this question.

Hi, I did some searching and understand it better now. Could you please answer the following question about MPS just by the way?
Under Volta or newer architectures, is it ok to use just default compute mode rather than exclusive mode. I know that the latter is recomendded but will it be slower for my application? I have always been using default mode and have no problem.

The only information I have here is what is contained in the MPS user guide:

https://docs.nvidia.com/deploy/mps/index.html#topic_3_3_1_2

When using MPS it is recommended to use EXCLUSIVE_PROCESS mode to ensure that only a single MPS server is using the GPU, which provides additional insurance that the MPS server is the single point of arbitration between all CUDA processes for that GPU.

I don’t have any information about performance with or without.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.