How to Enforce Per-Client Memory and SM Limits in CUDA MPS?

I’m trying to enforce per-client resource limits in CUDA MPS but not seeing the expected behavior.

In my Kubernetes Pod spec, I set the following environment variables:

- name: CUDA_MPS_ACTIVE_THREAD_PERCENTAGE
  value: "40"
- name: CUDA_MPS_ENABLE_PER_CTX_DEVICE_MULTIPROCESSOR_PARTITIONING
  value: "1"
- name: CUDA_MPS_PINNED_DEVICE_MEM_LIMIT
  value: "0=40G"
- name: CUDA_MPS_CLIENT_PRIORITY
  value: "0"

However, when I check nvidia-smi pmon, the sm% is still close to 100%. Even after running:

echo "get_active_thread_percentage 7078" | nvidia-cuda-mps-control

I get 100.0. So the limits appear to not be applied.

What am I missing? Does MPS ignore SM limits set via CUDA_MPS_ACTIVE_THREAD_PERCENTAGE with per-context partitioning? Is CUDA_MPS_CLIENT_PRIORITY relevant here? How do I ensure each client only uses the intended amount of SM and memory?

You may find this thread useful.