MPS error containment, fatal GPU exception qualification

Hi forum,

I’m trying to figure out what qualify for a “fatal GPU exception” as mentioned in the MPS document:

  • A fatal GPU exception generated by a Volta MPS client process will be contained within the subset of GPUs shared between all clients with the fatal exception-causing GPU.

I’ve played around something like

cudaMalloc(&ptr, std::numeric_limits<std::size_t>::max());

But they all seems not quite fatal.

We’re trying to explore the possibility to get MPS running with our k8s nodes and are exploring the boundaries of error isolation. We want to see what less error containment MPS will result into.

A “fatal” exception here probably refers to a “sticky” or “asynchronous” error as discussed here. Basically any error that occurs as a result of kernel execution. In the non-MPS case, such errors corrupt the CUDA context, and are therefore “fatal” in the sense that they prevent any further execution of CUDA activity in that process.

The above statements referring to “CUDA context” may have to be adapted or modified when we are talking about MPS. The examples you have already tried I believe would fall into this category:

CUDA API errors generated on the CPU in the CUDA Runtime or CUDA Driver are delivered only to the calling client.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.