MPS Error Containment - Can't able to create new process

As per the Nvidia doc on MPS Error Containment here

I tried to run Process A and B on GPU 0, Process C and D on the GPU 1 with MPS enabled.

When I introduced new error process E on GPU 0 which will create an illegal memory access error. Error process E triggers a fatal GPU exception on all the client in GPU 0 (Process A and B).

Clients running on the GPU 1 is unaffected and executed successfully.

But the problem is when I tried to create a new process on GPU 1 without killing the affected clients in the GPU 0, I am getting cudaErrorMpsServerNotReady error.

My understanding is that the fatal GPU exception is need to contained only on GPU 0. It should not affect GPU 1. But I cannot able to create any new process on GPU 1.

So Is that mandatory to kill all the clients with fatal error on GPU 0 before creating any process on GPU 1 ?

1 Like

What GPU are you running on?

Both are Nvidia’s A30 GPUs.

Are you running a single MPS server that covers both GPUs?

Yes I am running a single MPS server. Do I need to run a MPS server per GPU ?

The MPS documentation you linked explains that when an error like this occurs, the unaffected clients will continue to operate, but new client requests will be refused until all clients using the MPS server have exited. Please read again the section you linked. Specifically this:

The MPS server will wait for client A, client B, and client C to exit and reject any new client requests to device 0 and device 1 will be rejected with error CUDA_ERROR_MPS_SERVER_NOT_READY. After client A, client B, and client C have exited, the server recreates the GPU contexts on device 0 and device 1 and then resumes accepting client requests to device 0 and device 1.

No, I am not suggesting that you run a MPS server per GPU. It’s necessary for me to ask these questions to give any sort of response. Just like I had to ask you what GPU you are running on. If you omit important details from your question, then I have to ask you for those details.

Based on example which is provided in that doc, they mentioned that device 0 and device 1 will reject the new client request. In that doc they didn’t mention anything about device 2. So my doubt is before exiting the affected client processes on device 0 and device 1, Is device 2 will accept client or not ?

Note: device 2 is not affected by fatal exception.

The example given shows that although the exception occurs on device 0, the device 1 and any clients using device 1 are also affected. I believe this covers your case: if a client has a particular device “in view” (that is, visible to the CUDA runtime) then that device is affected, and furthermore you are witnessing that behavior.

So Is that mandatory to kill all the clients with fatal error on GPU 0 before creating any process on GPU 1 ?

Yes, it seems to be, for your particular usage pattern, and I’ve indicated why I think so. I’m not likely to be able to provide further comments.

1 Like