Hello everyone,
I’m currently facing an issue with CUDA MPS in a multi-GPU environment. MPS works as expected in a single-GPU setting, but in a multi-GPU environment, all submitted jobs seem to be routed to the first GPU, leaving the remaining GPUs idle while other jobs sit in the queue.
System and Configuration Details
I’m using Slurm 23.11.9. Below are my Slurm and configuration details:
Slurm configuration:
(base) vinil@slurmgpu-scheduler:~$ grep Gres /etc/slurm/slurm.conf
GresTypes=gpu,mps
(base) vinil@slurmgpu-scheduler:~$ grep Gres /etc/slurm/azure.conf
Nodename=slurmgpu-hpc-1 Feature=cloud STATE=CLOUD CPUs=96 ThreadsPerCore=1 RealMemory=875520 Gres=gpu:8,mps:800
Gres configuration:
(base) vinil@slurmgpu-scheduler:~$ cat /etc/slurm/gres.conf
Nodename=slurmgpu-hpc-1 Name=gpu Count=8 File=/dev/nvidia[0-7]
Nodename=slurmgpu-hpc-1 Name=mps Count=800 File=/dev/nvidia[0-7]
Job Script Details
Here’s the job script I’m using:
#!/bin/
#SBATCH --job-name=cuda_mps_job
#SBATCH --output=cuda_mps_output.%j
#SBATCH --error=cuda_mps_error.%j
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=3
#SBATCH --gres=mps:25
#SBATCH --time=01:00:00
#SBATCH --partition=hpc
export CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps-$SLURM_JOB_ID
export CUDA_MPS_LOG_DIRECTORY=/tmp/nvidia-log-$SLURM_JOB_ID
mkdir -p $CUDA_MPS_PIPE_DIRECTORY
mkdir -p $CUDA_MPS_LOG_DIRECTORY
if ! pgrep -x “nvidia-cuda-mps-control” > /dev/null; then
echo “Starting MPS control daemon…”
nvidia-cuda-mps-control -d
fi
export CUDA_MPS_ACTIVE_THREAD_PERCENTAGE=25
source /shared/home/vinil/anaconda3/etc/profile.d/conda.sh
conda activate training_env
python distributed_training.py
echo “Stopping MPS control daemon…”
echo quit | nvidia-cuda-mps-control
rm -rf $CUDA_MPS_PIPE_DIRECTORY
rm -rf $CUDA_MPS_LOG_DIRECTORY
Issue Details
In my setup, I have configured 800 MPS shares, aiming for 100 MPS shares per GPU. Each job is configured to use 25 MPS shares, which should allow four jobs per GPU (32 jobs total on an 8-GPU node). However, when I submit jobs, only the first GPU is utilized, while the rest are idle, causing other jobs to remain in the queue.
What I’ve Tried
- CUDA_VISIBLE_DEVICES setting following the NVIDIA MPS documentation.
- Slurm OPT_MULTIPLE_SHARING_GRES_PJ: Attempted setting this flag in
slurm.conf
as suggested in the Slurm docs to allow jobs to share multiple GPUs, but no change.
Output from squeue
shows only jobs assigned to the first GPU, with the remaining jobs queued due to priority/resource limits.
(base) vinil@slurmgpu-scheduler:~$ squeue
(base) vinil@slurmgpu-scheduler:~$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
68 hpc cuda_mps vinil CF 0:03 1 slurmgpu-hpc-1
65 hpc cuda_mps vinil CF 0:04 1 slurmgpu-hpc-1
66 hpc cuda_mps vinil CF 0:04 1 slurmgpu-hpc-1
67 hpc cuda_mps vinil CF 0:04 1 slurmgpu-hpc-1
96 hpc cuda_mps vinil PD 0:00 1 (Priority)
95 hpc cuda_mps vinil PD 0:00 1 (Priority)
94 hpc cuda_mps vinil PD 0:00 1 (Priority)
93 hpc cuda_mps vinil PD 0:00 1 (Priority)
92 hpc cuda_mps vinil PD 0:00 1 (Priority)
91 hpc cuda_mps vinil PD 0:00 1 (Priority)
90 hpc cuda_mps vinil PD 0:00 1 (Priority)
89 hpc cuda_mps vinil PD 0:00 1 (Priority)
88 hpc cuda_mps vinil PD 0:00 1 (Priority)
87 hpc cuda_mps vinil PD 0:00 1 (Priority)
86 hpc cuda_mps vinil PD 0:00 1 (Priority)
85 hpc cuda_mps vinil PD 0:00 1 (Priority)
84 hpc cuda_mps vinil PD 0:00 1 (Priority)
83 hpc cuda_mps vinil PD 0:00 1 (Priority)
82 hpc cuda_mps vinil PD 0:00 1 (Priority)
81 hpc cuda_mps vinil PD 0:00 1 (Priority)
80 hpc cuda_mps vinil PD 0:00 1 (Priority)
79 hpc cuda_mps vinil PD 0:00 1 (Priority)
78 hpc cuda_mps vinil PD 0:00 1 (Priority)
77 hpc cuda_mps vinil PD 0:00 1 (Priority)
76 hpc cuda_mps vinil PD 0:00 1 (Priority)
75 hpc cuda_mps vinil PD 0:00 1 (Priority)
74 hpc cuda_mps vinil PD 0:00 1 (Priority)
73 hpc cuda_mps vinil PD 0:00 1 (Priority)
72 hpc cuda_mps vinil PD 0:00 1 (Priority)
71 hpc cuda_mps vinil PD 0:00 1 (Priority)
70 hpc cuda_mps vinil PD 0:00 1 (Priority)
69 hpc cuda_mps vinil PD 0:00 1 (Resources)
nvidia-smi output confirms that only the first GPU is active:
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A100-SXM4-40GB Off | 00000001:00:00.0 Off | 0 |
| N/A 38C P0 85W / 400W | 34066MiB / 40960MiB | 93% Default |
| | | Disabled |
±----------------------------------------±-----------------------±---------------------+
| 1 NVIDIA A100-SXM4-40GB Off | 00000002:00:00.0 Off | 0 |
| N/A 34C P0 54W / 400W | 1MiB / 40960MiB | 0% Default |
| | | Disabled |
±----------------------------------------±-----------------------±---------------------+
| 2 NVIDIA A100-SXM4-40GB Off | 00000003:00:00.0 Off | 0 |
| N/A 35C P0 52W / 400W | 1MiB / 40960MiB | 0% Default |
| | | Disabled |
±----------------------------------------±-----------------------±---------------------+
| 3 NVIDIA A100-SXM4-40GB Off | 00000004:00:00.0 Off | 0 |
| N/A 35C P0 57W / 400W | 1MiB / 40960MiB | 0% Default |
| | | Disabled |
±----------------------------------------±-----------------------±---------------------+
| 4 NVIDIA A100-SXM4-40GB Off | 0000000B:00:00.0 Off | 0 |
| N/A 35C P0 53W / 400W | 1MiB / 40960MiB | 0% Default |
| | | Disabled |
±----------------------------------------±-----------------------±---------------------+
| 5 NVIDIA A100-SXM4-40GB Off | 0000000C:00:00.0 Off | 0 |
| N/A 35C P0 55W / 400W | 1MiB / 40960MiB | 0% Default |
| | | Disabled |
±----------------------------------------±-----------------------±---------------------+
| 6 NVIDIA A100-SXM4-40GB Off | 0000000D:00:00.0 Off | 0 |
| N/A 35C P0 55W / 400W | 1MiB / 40960MiB | 0% Default |
| | | Disabled |
±----------------------------------------±-----------------------±---------------------+
| 7 NVIDIA A100-SXM4-40GB Off | 0000000E:00:00.0 Off | 0 |
| N/A 35C P0 55W / 400W | 1MiB / 40960MiB | 0% Default |
| | | Disabled |
±----------------------------------------±-----------------------±---------------------+
±----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 19017 M+C python 8480MiB |
| 0 N/A N/A 19018 M+C python 8480MiB |
| 0 N/A N/A 19019 M+C python 8480MiB |
| 0 N/A N/A 19020 M+C python 8480MiB |
| 0 N/A N/A 19045 C nvidia-cuda-mps-server 30MiB |
| 0 N/A N/A 19049 C nvidia-cuda-mps-server 30MiB |
| 0 N/A N/A 19050 C nvidia-cuda-mps-server 30MiB |
| 0 N/A N/A 19051 C nvidia-cuda-mps-server 30MiB |
±----------------------------------------------------------------------------------------+
Request
Has anyone experienced similar issues or have insights on resolving this? Any help or suggestions would be much appreciated!