Does sticky CUDA error affect other host processes using the same GPU?

1055057679 · October 1, 2022, 7:57am

This is a question about CUDA runtime and error handling.
I have an “illegal memory access” error in my process, and I find that when it occurs, all other processes have trouble using the GPU. Other users on the computer are affected. But as far as I know, the CUDA context in my particular process is corrupted, but how come that it affect other people using the GPU? When I kill my process, everything works fine.

Robert_Crovella · October 1, 2022, 2:57pm

If your “illegal memory access” error/process also had a lot of memory reserved, then other processes are going to have trouble using the GPU. Once you kill the process, memory is released.

1055057679 · October 3, 2022, 7:10am

Thanks for reply, but in nvidia-smi I can see there are still several GBs of GPU memory available. But I couldn’t use any CUDA runtime function then. Couldn’t get DeviceCount, it throws a “CUDA unkown error”.

Robert_Crovella · October 3, 2022, 3:10pm

I suppose another possibility is that you have the compute mode on the GPU set to exclusive process, or equivalently that you are using MPS.

Assuming that is not the case, I can’t really explain your observation.

I tried a test case on two different setups, one with a V100 and the other with a GTX 970. I run an application that does an illegal operation in device code, and the host code spins forever in a while loop. I verify that application is still running on that GPU using nvidia-smi. Then I run vectorAdd from a separate process, on the same GPU. In both cases (CUDA 11.4, V100, and CUDA 11.7,GTX 970) the vectorAdd produced normal results (no errors).

So I’m not sure what leads to the condition that you observe.

1055057679 · October 8, 2022, 1:58am

Yes I am using MPS, and yes if I don’t use MPS, I can normally use GPU even when the error occurs without killing that process. I’m using MPS but didn’t set the compute mode to exclusive, still default.
So does this imply that using MPS is equivalent to setting exclusive compute mode, and in this mode, only one context is allowed one the GPU, so that when it is corrupted, no process can use the GPU. Sorry I didn’t have a clear understanding of these concepts.
And thank you very much for answering this question.

1055057679 · October 8, 2022, 8:49am

Hi, I did some searching and understand it better now. Could you please answer the following question about MPS just by the way?
Under Volta or newer architectures, is it ok to use just default compute mode rather than exclusive mode. I know that the latter is recomendded but will it be slower for my application? I have always been using default mode and have no problem.

Robert_Crovella · October 8, 2022, 1:27pm

The only information I have here is what is contained in the MPS user guide:

https://docs.nvidia.com/deploy/mps/index.html#topic_3_3_1_2

When using MPS it is recommended to use EXCLUSIVE_PROCESS mode to ensure that only a single MPS server is using the GPU, which provides additional insurance that the MPS server is the single point of arbitration between all CUDA processes for that GPU.

I don’t have any information about performance with or without.

system · October 22, 2022, 1:28pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
MPI causing trouble in memory allocation? CUDA Programming and Performance	5	11927	November 28, 2009
Using multi-threaded programs with multiple GPUs in EXCLUSIVE_PROCESS compute mode CUDA Programming and Performance	2	4423	July 30, 2014
CUDA+MPI = Unexplained Issues... Random Crashes, Errenous Output?!? CUDA Programming and Performance	5	3312	July 7, 2008
Failure with independent devices on independent processes Try it yourself! CUDA Programming and Performance	19	3556	March 10, 2011
multi gpu + exclusive mode + matlab, can't run two processes - kernel crashes CUDA Programming and Performance	39	9379	July 1, 2010
about running cuda on a gpu cluster CUDA Programming and Performance	25	21712	May 31, 2010
CudaMalloc fails when more of 2 linux process acces to the GPU 0 CUDA Programming and Performance	2	1170	February 24, 2009
Exclusive compute mode doesn't work with multiple GTX295's & 64-bit Linux CUDA Programming and Performance	2	2725	September 17, 2009
Multi-User GPGPU CUDA Programming and Performance	8	1475	September 10, 2010
Exclusive Mode and More CPUs Than GPUs Can I overschedule GPUs in exclusive mode? CUDA Programming and Performance	3	21999	June 16, 2010

Does sticky CUDA error affect other host processes using the same GPU?

Related topics