MPS thread limit and 100% GPU usage

daire · June 10, 2025, 1:59pm

Hi,

I have been playing with the MPS with CUDA_MPS_ACTIVE_THREAD_PERCENTAGE=20 (say) and while I can verify that performance of my CUDA application is indeed limited accordingly, the actual GPU usage reported for mps is still ~100% and the wattage of the GPU is ~100% too.

So even though MPS can limit the CUDA app’s performance, the actual GPU utilisation is not effected? Is the MPS server actually doing energy consuming “work” on the other 80% of SM threads keeping it available for another process?

My test was just using the popular “gpu_burn” stress test so maybe that is doing something “special”?

I was interested to see if MPS could be used to limit a single heavy process and always leave room for much smaller and latency sensitive processes to run (that might not even run via MPS - e.g OpenGL stuff).

Robert_Crovella · June 10, 2025, 3:14pm

GPU utilization reported by nvidia-smi is basically answering a question “was a CUDA kernel running?”

If the answer is yes, over a “recent” time period, the metric will report 100%. It is not reporting any more detailed or granular information than that.

Given that, if by “utilization” you are referring to what is reported by nvidia-smi, then it is correct that a continuously running kernel that is restricted to using say, 20% of the SM/thread resources by MPS, will show 100% utilization of the GPU.

the nvidia-smi dmon “subsystem” may give you more granular info reporting, but I haven’t used it much and don’t have any recipe to suggest. nvidia-smi has command-line help (e.g. nvidia-smi dmon --help) so you could play with it if interested.

daire · June 10, 2025, 3:56pm

Okay thanks. I guess I’m just used to seeing things like “50%” utilisation in nvidia-smi and it also relating to ~50% power consumption too.

But with MPS I just see this 100% util and much higher than 20% (limit) watts.

For example the nvidia-smi dmon output for a 4 GPU L40S running gpu_burn with CUDA_MPS_ACTIVE_THREAD_PERCENTAGE=20 :

gpu pwr gtemp mtemp sm mem enc dec jpg ofa mclk pclk

Idx W C C % % % % % % MHz MHz

0    222     49      -    100     46      0      0      0      0   9000   2520 
1    234     53      -    100      2      0      0      0      0   9000   2520 
2    233     50      -    100      2      0      0      0      0   9000   2520 
3    240     51      -    100      8      0      0      0      0   9000   2520

With CUDA_MPS_ACTIVE_THREAD_PERCENTAGE=50:

gpu pwr gtemp mtemp sm mem enc dec jpg ofa mclk pclk

Idx W C C % % % % % % MHz MHz

0    342     65      -    100      5      0      0      0      0   9000   2520 
1    354     68      -    100      5      0      0      0      0   9000   2505 
2    346     67      -    100      5      0      0      0      0   9000   2505 
3    342     68      -    100      5      0      0      0      0   9000   2490

And again with CUDA_MPS_ACTIVE_THREAD_PERCENTAGE=100:

gpu pwr gtemp mtemp sm mem enc dec jpg ofa mclk pclk

Idx W C C % % % % % % MHz MHz

0    336     58      -    100      7      0      0      0      0   9000   2070 
1    338     60      -    100      7      0      0      0      0   9000   2055 
2    343     59      -    100      7      0      0      0      0   9000   2070 
3    341     60      -    100      7      0      0      0      0   9000   2025

There is certainly a difference at 20%, but it’s not as much as I was expecting (~20%). And at 50% it just looks the same as 100%.

Robert_Crovella · June 10, 2025, 4:18pm

Yes, even without MPS, what I observe is that the nvidia-smi dmon “sm” entity seems to report the same info as what I had previously linked. So a continuously running kernel is going to show as 100% even if it only uses 1 thread in 1 block on 1 SM.

With respect to power, my L4 GPU with a single kernel running on a single thread on a single block on a single SM, shows 16 watts at idle and 30W when the app/kernel is running. I do think MPS could affect this observation, because I believe MPS may continously maintain a context on the GPU(s) it is managing, so this may affect idle, for example. I won’t be able to give a detailed explanation of behavior beyond that. None of it seems surprising to me.

It does seem that nvidia-smi dmon has a “new” layer of metrics referred to as “gpm” metrics, but these are only supported on H100 (i.e. Hopper) and newer GPUs.

daire · June 10, 2025, 4:38pm

Yea, I think you are right about MPS maintaining a context and so there are no idle periods.

I tried something a bit crazy (because I could) and took two vGPU VMs on the same pGPU (best-effort) and ran gpu_burn with MPS and 10% limit on one VM and gpu_burn as normal (without MPS) on the other VM.

The VM with a 10% limit ran at 10% speed but the VM with no MPS still ran at 50% speed - the same as if both had been running together without MPS. It didn’t expand to fill any unused compute on the MPS VM.

So you can’t really limit power or idleness using MPS.

Robert_Crovella · June 10, 2025, 4:55pm

Just out of curiosity, if you kill MPS (and otherwise leave that VM idle), then is the “other” VM able to get more than 50% speed? (that’s what best effort seems to suggest - I was just wondering if you could actually witness that).

daire · June 10, 2025, 7:49pm

Correct. If one VM is idle, the other VM’s CUDA performance is close to 100% (or the pGPU). If both VMs run the same gpu_burn (without MPS), their performance is about 50/50.

And now with MPS on one limiting the threads to 20%, it still acts like it’s 100% per VM and 50% for each VM of the pGPU.

keval.shah1 · August 14, 2025, 12:37am

@daire @Robert_Crovella

How do you verify MPS limits are being enforced? I set the following env vars but wasn’t able to verify.

- name: CUDA_MPS_ACTIVE_THREAD_PERCENTAGE
  value: "40"
- name: CUDA_MPS_ENABLE_PER_CTX_DEVICE_MULTIPROCESSOR_PARTITIONING
  value: "1"
- name: CUDA_MPS_PINNED_DEVICE_MEM_LIMIT
  value: "0=40G"
- name: CUDA_MPS_CLIENT_PRIORITY
  value: "0"

echo "get_active_thread_percentage 7078" | nvidia-cuda-mps-control
100%

Topic		Replies	Views
Can CUDA MPS limit the GPU memory usage of a client process? CUDA Programming and Performance	1	790	May 7, 2020
MPS: Limiting threads to different thresholds for multi-GPU processes CUDA Programming and Performance tensorflow , kernel , ubuntu , python , linux	1	772	October 27, 2021
Improving MPS performance using Volta MPS Execution Resource Provisioning CUDA Programming and Performance	5	1477	July 4, 2019
Multi-Process Service Active Thread Percentage CUDA Programming and Performance	0	517	May 5, 2022
Can MPS control per gpu QOS if multiple GPUs are managed by MPS? CUDA Programming and Performance	0	571	January 25, 2019
CUDA MPS Problem CUDA Programming and Performance cuda	7	1449	May 23, 2022
How to Enforce Per-Client Memory and SM Limits in CUDA MPS? CUDA Programming and Performance cuda , kernel , inception	1	148	August 13, 2025
Question about CUDA MPS CUDA Programming and Performance	15	3226	August 22, 2022
Process/client-level GPU utilization observability CUDA Programming and Performance cuda , kernel , kubernetes	0	108	July 30, 2025
Cocurrent execution with MPS CUDA Programming and Performance	5	706	November 11, 2020

MPS thread limit and 100% GPU usage

gpu pwr gtemp mtemp sm mem enc dec jpg ofa mclk pclk

Idx W C C % % % % % % MHz MHz

gpu pwr gtemp mtemp sm mem enc dec jpg ofa mclk pclk

Idx W C C % % % % % % MHz MHz

gpu pwr gtemp mtemp sm mem enc dec jpg ofa mclk pclk

Idx W C C % % % % % % MHz MHz

Related topics