CUDA MPS Not Working as Expected in Multi-GPU Environment

Hello everyone,

I’m currently facing an issue with CUDA MPS in a multi-GPU environment. MPS works as expected in a single-GPU setting, but in a multi-GPU environment, all submitted jobs seem to be routed to the first GPU, leaving the remaining GPUs idle while other jobs sit in the queue.

System and Configuration Details

I’m using Slurm 23.11.9. Below are my Slurm and configuration details:

Slurm configuration:

(base) vinil@slurmgpu-scheduler:~$ grep Gres /etc/slurm/slurm.conf
GresTypes=gpu,mps

(base) vinil@slurmgpu-scheduler:~$ grep Gres /etc/slurm/azure.conf
Nodename=slurmgpu-hpc-1 Feature=cloud STATE=CLOUD CPUs=96 ThreadsPerCore=1 RealMemory=875520 Gres=gpu:8,mps:800

Gres configuration:

(base) vinil@slurmgpu-scheduler:~$ cat /etc/slurm/gres.conf
Nodename=slurmgpu-hpc-1 Name=gpu Count=8 File=/dev/nvidia[0-7]
Nodename=slurmgpu-hpc-1 Name=mps Count=800 File=/dev/nvidia[0-7]

Job Script Details

Here’s the job script I’m using:

#!/bin/
#SBATCH --job-name=cuda_mps_job
#SBATCH --output=cuda_mps_output.%j
#SBATCH --error=cuda_mps_error.%j
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=3
#SBATCH --gres=mps:25
#SBATCH --time=01:00:00
#SBATCH --partition=hpc

export CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps-$SLURM_JOB_ID
export CUDA_MPS_LOG_DIRECTORY=/tmp/nvidia-log-$SLURM_JOB_ID

mkdir -p $CUDA_MPS_PIPE_DIRECTORY
mkdir -p $CUDA_MPS_LOG_DIRECTORY

if ! pgrep -x “nvidia-cuda-mps-control” > /dev/null; then
echo “Starting MPS control daemon…”
nvidia-cuda-mps-control -d
fi

export CUDA_MPS_ACTIVE_THREAD_PERCENTAGE=25

source /shared/home/vinil/anaconda3/etc/profile.d/conda.sh
conda activate training_env
python distributed_training.py

echo “Stopping MPS control daemon…”
echo quit | nvidia-cuda-mps-control
rm -rf $CUDA_MPS_PIPE_DIRECTORY
rm -rf $CUDA_MPS_LOG_DIRECTORY

Issue Details

In my setup, I have configured 800 MPS shares, aiming for 100 MPS shares per GPU. Each job is configured to use 25 MPS shares, which should allow four jobs per GPU (32 jobs total on an 8-GPU node). However, when I submit jobs, only the first GPU is utilized, while the rest are idle, causing other jobs to remain in the queue.

What I’ve Tried

  • CUDA_VISIBLE_DEVICES setting following the NVIDIA MPS documentation.
  • Slurm OPT_MULTIPLE_SHARING_GRES_PJ: Attempted setting this flag in slurm.conf as suggested in the Slurm docs to allow jobs to share multiple GPUs, but no change.

Output from squeue shows only jobs assigned to the first GPU, with the remaining jobs queued due to priority/resource limits.

(base) vinil@slurmgpu-scheduler:~$ squeue
(base) vinil@slurmgpu-scheduler:~$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
68 hpc cuda_mps vinil CF 0:03 1 slurmgpu-hpc-1
65 hpc cuda_mps vinil CF 0:04 1 slurmgpu-hpc-1
66 hpc cuda_mps vinil CF 0:04 1 slurmgpu-hpc-1
67 hpc cuda_mps vinil CF 0:04 1 slurmgpu-hpc-1
96 hpc cuda_mps vinil PD 0:00 1 (Priority)
95 hpc cuda_mps vinil PD 0:00 1 (Priority)
94 hpc cuda_mps vinil PD 0:00 1 (Priority)
93 hpc cuda_mps vinil PD 0:00 1 (Priority)
92 hpc cuda_mps vinil PD 0:00 1 (Priority)
91 hpc cuda_mps vinil PD 0:00 1 (Priority)
90 hpc cuda_mps vinil PD 0:00 1 (Priority)
89 hpc cuda_mps vinil PD 0:00 1 (Priority)
88 hpc cuda_mps vinil PD 0:00 1 (Priority)
87 hpc cuda_mps vinil PD 0:00 1 (Priority)
86 hpc cuda_mps vinil PD 0:00 1 (Priority)
85 hpc cuda_mps vinil PD 0:00 1 (Priority)
84 hpc cuda_mps vinil PD 0:00 1 (Priority)
83 hpc cuda_mps vinil PD 0:00 1 (Priority)
82 hpc cuda_mps vinil PD 0:00 1 (Priority)
81 hpc cuda_mps vinil PD 0:00 1 (Priority)
80 hpc cuda_mps vinil PD 0:00 1 (Priority)
79 hpc cuda_mps vinil PD 0:00 1 (Priority)
78 hpc cuda_mps vinil PD 0:00 1 (Priority)
77 hpc cuda_mps vinil PD 0:00 1 (Priority)
76 hpc cuda_mps vinil PD 0:00 1 (Priority)
75 hpc cuda_mps vinil PD 0:00 1 (Priority)
74 hpc cuda_mps vinil PD 0:00 1 (Priority)
73 hpc cuda_mps vinil PD 0:00 1 (Priority)
72 hpc cuda_mps vinil PD 0:00 1 (Priority)
71 hpc cuda_mps vinil PD 0:00 1 (Priority)
70 hpc cuda_mps vinil PD 0:00 1 (Priority)
69 hpc cuda_mps vinil PD 0:00 1 (Resources)

nvidia-smi output confirms that only the first GPU is active:

| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A100-SXM4-40GB Off | 00000001:00:00.0 Off | 0 |
| N/A 38C P0 85W / 400W | 34066MiB / 40960MiB | 93% Default |
| | | Disabled |
±----------------------------------------±-----------------------±---------------------+
| 1 NVIDIA A100-SXM4-40GB Off | 00000002:00:00.0 Off | 0 |
| N/A 34C P0 54W / 400W | 1MiB / 40960MiB | 0% Default |
| | | Disabled |
±----------------------------------------±-----------------------±---------------------+
| 2 NVIDIA A100-SXM4-40GB Off | 00000003:00:00.0 Off | 0 |
| N/A 35C P0 52W / 400W | 1MiB / 40960MiB | 0% Default |
| | | Disabled |
±----------------------------------------±-----------------------±---------------------+
| 3 NVIDIA A100-SXM4-40GB Off | 00000004:00:00.0 Off | 0 |
| N/A 35C P0 57W / 400W | 1MiB / 40960MiB | 0% Default |
| | | Disabled |
±----------------------------------------±-----------------------±---------------------+
| 4 NVIDIA A100-SXM4-40GB Off | 0000000B:00:00.0 Off | 0 |
| N/A 35C P0 53W / 400W | 1MiB / 40960MiB | 0% Default |
| | | Disabled |
±----------------------------------------±-----------------------±---------------------+
| 5 NVIDIA A100-SXM4-40GB Off | 0000000C:00:00.0 Off | 0 |
| N/A 35C P0 55W / 400W | 1MiB / 40960MiB | 0% Default |
| | | Disabled |
±----------------------------------------±-----------------------±---------------------+
| 6 NVIDIA A100-SXM4-40GB Off | 0000000D:00:00.0 Off | 0 |
| N/A 35C P0 55W / 400W | 1MiB / 40960MiB | 0% Default |
| | | Disabled |
±----------------------------------------±-----------------------±---------------------+
| 7 NVIDIA A100-SXM4-40GB Off | 0000000E:00:00.0 Off | 0 |
| N/A 35C P0 55W / 400W | 1MiB / 40960MiB | 0% Default |
| | | Disabled |
±----------------------------------------±-----------------------±---------------------+

±----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 19017 M+C python 8480MiB |
| 0 N/A N/A 19018 M+C python 8480MiB |
| 0 N/A N/A 19019 M+C python 8480MiB |
| 0 N/A N/A 19020 M+C python 8480MiB |
| 0 N/A N/A 19045 C nvidia-cuda-mps-server 30MiB |
| 0 N/A N/A 19049 C nvidia-cuda-mps-server 30MiB |
| 0 N/A N/A 19050 C nvidia-cuda-mps-server 30MiB |
| 0 N/A N/A 19051 C nvidia-cuda-mps-server 30MiB |
±----------------------------------------------------------------------------------------+

Request

Has anyone experienced similar issues or have insights on resolving this? Any help or suggestions would be much appreciated!

although MPS can be configured to use multiple GPUs, when multiple GPUs are visible to the MPS server/daemon, there is no automatic distribution system to route different jobs to different GPUs. The usage model here in this case is still the CUDA Multi-GPU usage model.

1 Like

Are you saying that using slurm we cannot use MPS to run jobs in multi-gpu environment?

No, I didn’t say anything about slurm, per se. The key message is that MPS by itself does not do automatic work distribution. It will not in a round-robin fashion assign one single-GPU job to GPU 0 and the next single-GPU job to GPU 1, for example. When multiple GPUs are visible to the MPS server, this has ramifications that you will need to think through. You are in a multi-GPU environment, in that case.

Thanks. Has anyone used MPS on a Slurm multi-GPU cluster? I can’t find any references online; all the discussions seem to focus on single GPU setups. Any insights would be appreciated.