I have been playing with the MPS with CUDA_MPS_ACTIVE_THREAD_PERCENTAGE=20 (say) and while I can verify that performance of my CUDA application is indeed limited accordingly, the actual GPU usage reported for mps is still ~100% and the wattage of the GPU is ~100% too.
So even though MPS can limit the CUDA app’s performance, the actual GPU utilisation is not effected? Is the MPS server actually doing energy consuming “work” on the other 80% of SM threads keeping it available for another process?
My test was just using the popular “gpu_burn” stress test so maybe that is doing something “special”?
I was interested to see if MPS could be used to limit a single heavy process and always leave room for much smaller and latency sensitive processes to run (that might not even run via MPS - e.g OpenGL stuff).
GPU utilization reported by nvidia-smi is basically answering a question “was a CUDA kernel running?”
If the answer is yes, over a “recent” time period, the metric will report 100%. It is not reporting any more detailed or granular information than that.
Given that, if by “utilization” you are referring to what is reported by nvidia-smi, then it is correct that a continuously running kernel that is restricted to using say, 20% of the SM/thread resources by MPS, will show 100% utilization of the GPU.
the nvidia-smi dmon “subsystem” may give you more granular info reporting, but I haven’t used it much and don’t have any recipe to suggest. nvidia-smi has command-line help (e.g. nvidia-smi dmon --help) so you could play with it if interested.
Yes, even without MPS, what I observe is that the nvidia-smi dmon “sm” entity seems to report the same info as what I had previously linked. So a continuously running kernel is going to show as 100% even if it only uses 1 thread in 1 block on 1 SM.
With respect to power, my L4 GPU with a single kernel running on a single thread on a single block on a single SM, shows 16 watts at idle and 30W when the app/kernel is running. I do think MPS could affect this observation, because I believe MPS may continously maintain a context on the GPU(s) it is managing, so this may affect idle, for example. I won’t be able to give a detailed explanation of behavior beyond that. None of it seems surprising to me.
It does seem that nvidia-smi dmon has a “new” layer of metrics referred to as “gpm” metrics, but these are only supported on H100 (i.e. Hopper) and newer GPUs.
Yea, I think you are right about MPS maintaining a context and so there are no idle periods.
I tried something a bit crazy (because I could) and took two vGPU VMs on the same pGPU (best-effort) and ran gpu_burn with MPS and 10% limit on one VM and gpu_burn as normal (without MPS) on the other VM.
The VM with a 10% limit ran at 10% speed but the VM with no MPS still ran at 50% speed - the same as if both had been running together without MPS. It didn’t expand to fill any unused compute on the MPS VM.
So you can’t really limit power or idleness using MPS.
Just out of curiosity, if you kill MPS (and otherwise leave that VM idle), then is the “other” VM able to get more than 50% speed? (that’s what best effort seems to suggest - I was just wondering if you could actually witness that).
Correct. If one VM is idle, the other VM’s CUDA performance is close to 100% (or the pGPU). If both VMs run the same gpu_burn (without MPS), their performance is about 50/50.
And now with MPS on one limiting the threads to 20%, it still acts like it’s 100% per VM and 50% for each VM of the pGPU.