MPS thread limit and 100% GPU usage

Hi,

I have been playing with the MPS with CUDA_MPS_ACTIVE_THREAD_PERCENTAGE=20 (say) and while I can verify that performance of my CUDA application is indeed limited accordingly, the actual GPU usage reported for mps is still ~100% and the wattage of the GPU is ~100% too.

So even though MPS can limit the CUDA app’s performance, the actual GPU utilisation is not effected? Is the MPS server actually doing energy consuming “work” on the other 80% of SM threads keeping it available for another process?

My test was just using the popular “gpu_burn” stress test so maybe that is doing something “special”?

I was interested to see if MPS could be used to limit a single heavy process and always leave room for much smaller and latency sensitive processes to run (that might not even run via MPS - e.g OpenGL stuff).

GPU utilization reported by nvidia-smi is basically answering a question “was a CUDA kernel running?”

If the answer is yes, over a “recent” time period, the metric will report 100%. It is not reporting any more detailed or granular information than that.

Given that, if by “utilization” you are referring to what is reported by nvidia-smi, then it is correct that a continuously running kernel that is restricted to using say, 20% of the SM/thread resources by MPS, will show 100% utilization of the GPU.

the nvidia-smi dmon “subsystem” may give you more granular info reporting, but I haven’t used it much and don’t have any recipe to suggest. nvidia-smi has command-line help (e.g. nvidia-smi dmon --help) so you could play with it if interested.

Okay thanks. I guess I’m just used to seeing things like “50%” utilisation in nvidia-smi and it also relating to ~50% power consumption too.

But with MPS I just see this 100% util and much higher than 20% (limit) watts.

For example the nvidia-smi dmon output for a 4 GPU L40S running gpu_burn with CUDA_MPS_ACTIVE_THREAD_PERCENTAGE=20 :

gpu pwr gtemp mtemp sm mem enc dec jpg ofa mclk pclk

Idx W C C % % % % % % MHz MHz

0    222     49      -    100     46      0      0      0      0   9000   2520 
1    234     53      -    100      2      0      0      0      0   9000   2520 
2    233     50      -    100      2      0      0      0      0   9000   2520 
3    240     51      -    100      8      0      0      0      0   9000   2520 

With CUDA_MPS_ACTIVE_THREAD_PERCENTAGE=50:

gpu pwr gtemp mtemp sm mem enc dec jpg ofa mclk pclk

Idx W C C % % % % % % MHz MHz

0    342     65      -    100      5      0      0      0      0   9000   2520 
1    354     68      -    100      5      0      0      0      0   9000   2505 
2    346     67      -    100      5      0      0      0      0   9000   2505 
3    342     68      -    100      5      0      0      0      0   9000   2490 

And again with CUDA_MPS_ACTIVE_THREAD_PERCENTAGE=100:

gpu pwr gtemp mtemp sm mem enc dec jpg ofa mclk pclk

Idx W C C % % % % % % MHz MHz

0    336     58      -    100      7      0      0      0      0   9000   2070 
1    338     60      -    100      7      0      0      0      0   9000   2055 
2    343     59      -    100      7      0      0      0      0   9000   2070 
3    341     60      -    100      7      0      0      0      0   9000   2025 

There is certainly a difference at 20%, but it’s not as much as I was expecting (~20%). And at 50% it just looks the same as 100%.

Yes, even without MPS, what I observe is that the nvidia-smi dmon “sm” entity seems to report the same info as what I had previously linked. So a continuously running kernel is going to show as 100% even if it only uses 1 thread in 1 block on 1 SM.

With respect to power, my L4 GPU with a single kernel running on a single thread on a single block on a single SM, shows 16 watts at idle and 30W when the app/kernel is running. I do think MPS could affect this observation, because I believe MPS may continously maintain a context on the GPU(s) it is managing, so this may affect idle, for example. I won’t be able to give a detailed explanation of behavior beyond that. None of it seems surprising to me.

It does seem that nvidia-smi dmon has a “new” layer of metrics referred to as “gpm” metrics, but these are only supported on H100 (i.e. Hopper) and newer GPUs.

Yea, I think you are right about MPS maintaining a context and so there are no idle periods.

I tried something a bit crazy (because I could) and took two vGPU VMs on the same pGPU (best-effort) and ran gpu_burn with MPS and 10% limit on one VM and gpu_burn as normal (without MPS) on the other VM.

The VM with a 10% limit ran at 10% speed but the VM with no MPS still ran at 50% speed - the same as if both had been running together without MPS. It didn’t expand to fill any unused compute on the MPS VM.

So you can’t really limit power or idleness using MPS.

Just out of curiosity, if you kill MPS (and otherwise leave that VM idle), then is the “other” VM able to get more than 50% speed? (that’s what best effort seems to suggest - I was just wondering if you could actually witness that).

Correct. If one VM is idle, the other VM’s CUDA performance is close to 100% (or the pGPU). If both VMs run the same gpu_burn (without MPS), their performance is about 50/50.

And now with MPS on one limiting the threads to 20%, it still acts like it’s 100% per VM and 50% for each VM of the pGPU.

1 Like

@daire @Robert_Crovella

How do you verify MPS limits are being enforced? I set the following env vars but wasn’t able to verify.

- name: CUDA_MPS_ACTIVE_THREAD_PERCENTAGE
  value: "40"
- name: CUDA_MPS_ENABLE_PER_CTX_DEVICE_MULTIPROCESSOR_PARTITIONING
  value: "1"
- name: CUDA_MPS_PINNED_DEVICE_MEM_LIMIT
  value: "0=40G"
- name: CUDA_MPS_CLIENT_PRIORITY
  value: "0"
echo "get_active_thread_percentage 7078" | nvidia-cuda-mps-control
100%