I’m trying to enforce per-client resource limits in CUDA MPS but not seeing the expected behavior.
In my Kubernetes Pod spec, I set the following environment variables:
- name: CUDA_MPS_ACTIVE_THREAD_PERCENTAGE
value: "40"
- name: CUDA_MPS_ENABLE_PER_CTX_DEVICE_MULTIPROCESSOR_PARTITIONING
value: "1"
- name: CUDA_MPS_PINNED_DEVICE_MEM_LIMIT
value: "0=40G"
- name: CUDA_MPS_CLIENT_PRIORITY
value: "0"
However, when I check nvidia-smi pmon, the sm% is still close to 100%. Even after running:
echo "get_active_thread_percentage 7078" | nvidia-cuda-mps-control
I get 100.0. So the limits appear to not be applied.
What am I missing? Does MPS ignore SM limits set via CUDA_MPS_ACTIVE_THREAD_PERCENTAGE with per-context partitioning? Is CUDA_MPS_CLIENT_PRIORITY relevant here? How do I ensure each client only uses the intended amount of SM and memory?