So nvidia-smi reveals 3 compute modes:
0: Normal mode 1: Compute-exclusive mode (only one compute program per GPU allowed) 2: Compute-prohibited mode (no compute programs may run on this GPU)
This is the understanding I have of the first 2 modes:
Normal mode (sometimes referred to as “shared mode”) is the default, and serially executes different threads’ kernels in the order queued. Access is blocked per gpu kernel execution. If multiple GPUs are available, but not explicitly specified, gpu0 is targeted by default.
Compute-exclusive mode locks ownership of a gpu per thread, not allowing any two threads to target the same gpu simultaneously. Access is blocked per host thread execution. If multiple GPUs are available, but not explicitly specified, the second thread will be assigned to the second GPU, etc. I don’t know of the expected behavior of the fallback if a specific GPU is targeted- I suppose that’s my first question.
So if my understanding above is correct, the obvious question is why not have an alternative (first available) device fallback feature for mode 0 as well? In a multi-gpu system, it would seem a more appropriate default than “Normal” (shared) mode too.
My final suggestion is that any time mode 0 is referred to as “shared mode”, it’s misleading, since the gpu device isn’t actually shared simultaneously (kernels block each other), and performance can be impacted. It would be more accurate IMO to refer to this as “oversubscribed mode”, since multiple threads can be blocked waiting on gpu kernels to execute. It might also be helpful to include the fallback feature in the name of mode 1.
I’d like to see something like this instead:
0: Oversubscribed-fallback mode 1: Exclusive-fallback mode (only one compute program per GPU allowed) 2: Prohibited mode (no compute programs may run on this GPU)
So the only change beyond the names is that mode 0 supports fallback to the first available gpu device instead remaining hung up for one specific device. Non-fallback modes could be added/kept if there is some reason I’m not aware of, but I can’t think of any off the cuff. The only other control measure I’d like a handle on would be some method of device acls for shared systems. But it’s probably best to address that in another post.
Please let me know if my initial understandings are incorrect or if this makes sense.