MPS Error Isolation

AFAIK, MPS doesn’t provide error isolation. Are there plans to improve that?

This is a complex topic because some of the error isolation and trapping happens at hardware level, so there’s an interplay between MPS and the available reporting from the GPU itself. We are exploring ways we can get better error isolation through MPS, but they’ll likely come with performance tradeoffs so they may not always be desirable.

Ultimately, NVIDIA is putting a lot of engineering focus onto improving both MPS and MIG to bring better error isolation and scheduling control to MPS, and more flexibility to MIG.

