cuDevicePrimaryCtxRetain vs cuCtxCreate

I see this in official Nvidia documentation:

From CUDA Runtime API :: CUDA Toolkit Documentation

Note that the use of multiple CUcontext s per device within a single process will substantially degrade performance and is strongly discouraged. Instead, it is highly recommended that the implicit one-to-one device-to-context mapping for the process provided by the CUDA Runtime API be used.

From CUDA Driver API :: CUDA Toolkit Documentation

Note:
In most cases it is recommended to use cuDevicePrimaryCtxRetain.

Under what circumstances would cuCtxCreate have to be used instead of cuDevicePrimaryCtxRetain?

If you wanted to create multiple contexts per process (per device).

To be clear, within a process, I can still use cuDevicePrimaryCtxRetain to create a context each per device.

In a single process, under what circumstances would more than one context per device be needed?

I’m not sure I can give an exhaustive answer. One of the aspects of separate contexts is isolation. The address space of one context is isolated from the address space of another context. There might be some situations where that is desirable.

Another aspect of multiple contexts might be called resilience. If I have 2 contexts, and one of them becomes corrupted, the other can still function normally, without requiring a device reset or any other behavior that you would need with the CUDA runtime API to restore behavior.

It might also be useful to have a separate context for a dynamically linked library. In fact, a library might create its own context.

I’m sure I can’t imagine all the cases where multiple contexts might be useful.

To help me better understand the level of isolation involved,
compare the following two scenarios (single process)

  1. Two devices and a context each

  2. Single device, two contexts.

What are isolated in 1) that are not isolated in 2)

I guess what I am asking is whether this isolation is also
happening in the device itself rather than just at the host.