Guidance on setting MPS_PIPE_DIRECTORY for multiple jobs in loop

mertz · May 8, 2025, 4:55pm

I am currently trying to adapt the mps.sh script that was provided in the blog post using MPS for high-throughput simulations of GROMACS to run openMM on a DGX A100 pod, with the goal of running the following hierarchy of simulations:

unique protein-ligand complex per GPU (so total of 8 protein-ligand complexes)
16 copies of a respective protein-ligand complex on each A100

I am running in a SLURM environment – unfortunately the MPS documentation only has guidance for PBS-based submissions (and it is somewhat limited).

I have come across limited mentions of using MPS with openMM (issue #3082 and issue #2535) but it is pretty clear from those posts that full utilization of MPS wasn’t pursued.

The forum discussion for the GROMACS blog post gave some clues on attempting to tackle this problem. Alan Gray suggested the following:

It looks like you just need to set a unique MPS pipe directory for each job, before launching MPS.

To do this for your first job (using, e.g. /tmp/mps1 for the directory):

export CUDA_MPS_PIPE_DIRECTORY=/tmp/mps1
mkdir -p $CUDA_MPS_PIPE_DIRECTORY
nvidia-cuda-mps-control -d

Then the second job on the same node should be able to use its GPU OK, and can also use MPS in a similar way, as long as it uses a different directory (e.g. /tmp/mps2).

I combined this suggestion with the mps script provided in the blog post along with the SLURM submission script I have been using to carry out these jobs to produce the following:

#!/bin/bash
# Demonstrator script to run multiple simulations per GPU with MPS on DGX-A100
#
# Alan Gray, NVIDIA

### SLURM SCHEDULER SETTINGS
## generic job settings for running in SLURM (w/GPUs)
## #SBATCH --job-name=sim_1        # job name (default is the name of this file)
## #SBATCH --output=log.%x.job_%j  # file name for stdout/stderr (%x will be replaced with the job name, %j with the jobid)
#SBATCH --time=0:10:00         # maximum wall time allocated for the job (D-H:MM:SS)
#SBATCH --exclusive               # request exclusive allocation of resources
#SBATCH --partition=026-partition # put the job into the gpu partition
## #SBATCH --mem=20G              # RAM per node
#SBATCH --threads-per-core=1      # do not use hyperthreads (i.e. CPUs = physical cores below)
#SBATCH --cpus-per-task=2         # number of CPUs per process

## nodes allocation
#SBATCH --nodes=1                 # number of nodes
## #SBATCH --ntasks-per-node=2     # MPI processes per node

## GPU allocation
#SBATCH --gpus-per-task=1       # number of GPUs per process
#SBATCH --gpu-bind=single:1     # bind each process to its own GPU (single:<tasks_per_gpu>)

source ~/.bashrc

# activate the conda environment
source activate modbind

# change to directory in which job is submitted
cd $SLURM_SUBMIT_DIR

# activate job management functions for stopping ModBind runs
source ./manage_resource.sh
source ./var_md.sh

# Location of ModBind environment
MBPY=/lustre/miniconda3/envs/modbind/bin/python

# Location of input files
CONFIG=/lustre/alivexis-data/modbind-testing/AURKA/simulateComplexWithSolvent.py

NGPU=${NUM_GPUS} # Number of GPUs in server
NCORE=128 # Number of CPU cores in server

NSIMPERGPU=${NUM_REPLICAS} # Number of simulations to run per GPU (with MPS)

# Number of threads per simulation
NTHREAD=$(($NCORE/($NGPU*$NSIMPERGPU)))
if [ $NTHREAD -eq 0 ]
then
    NTHREAD=1
fi
export OMP_NUM_THREADS=$NTHREAD

# Start MPS daemon
nvidia-cuda-mps-control -d

# Loop over number of GPUs in server
for (( i=0; i<$NGPU; i++ ));
do
    # Set a CPU NUMA specific to GPU in use with best affinity (specific to DGX-A100)
    case $i in
        0)NUMA=3;;
        1)NUMA=2;;
        2)NUMA=1;;
        3)NUMA=0;;
        4)NUMA=7;;
        5)NUMA=6;;
        6)NUMA=5;;
        7)NUMA=4;;
    esac

    # set variables to point to pdb and sdf files -- assign 1:1 ligand:GPU
    for f in `cat ${LISTDIR}/${LISTPREFIX}_1`
    do
        bname=$(basename $f | sed -e 's/.sdf//g')
        echo ${bname}
        output="${OUTPUTDIR}/${bname}"
        # pull PDB name specific to protein-ligand complex
        pdbname=$( echo $bname | sed -e 's/_/ /g' | awk '{print $NF}')
        pdbfile="${PDBDIR}/${pdbname}.pdb"
        # pull sdf filename for specific ligand
        sdffile=${SDFDIR}/${bname}.sdf
        # change to parent directory
        cd ${output}

        while true
        do

            # Loop over number of simulations per GPU
            for (( j=0; j<$NSIMPERGPU; j++ ));
            do
                # Create a unique identifier for this simulation to use as a working directory
                id=gpu${i}_sim${j}
                rm -rf $id
                mkdir -p $id
                cd $id
                # create symlink to ligand db.json GAFF file in child directory
                #ln -s ../db.json .
                cp ../db.json .

                # Launch openMM in the background on the desired resources
                echo "Launching simulation $j on GPU $i with $NTHREAD CPU thread(s) on NUMA region $NUMA"
                CUDA_VISIBLE_DEVICES=$i numactl --cpunodebind=$NUMA $MBPY $CONFIG \
                                    $pdbfile $sdffile mb-prod-$j 1000000 $j $i \
                                    > mps-test-$j.log 2>&1 &
                cd ..
            done
        cd ..
        done
    wait
    done
done
echo "Waiting for simulations to complete..."
wait

I am using another lookup file to define the variables referenced in this script (e.g., NUM_GPUS = 8 and NUM_REPLICAS = 2).

Depending on the location of the second-to-last “wait” call, the MPS daemon will either assign two simulations (NUM_REPLICAS) to the first GPU and wait for those simulations to complete, or will assign all 16 simulations (8 GPUs * 2 REPLICAS per GPU) to GPU0.

Alan’s suggestion to create a distinct MPS_PIPE_DIRECTORY for each GPU makes sense to me, but I am stuck on how to point the MPS daemon to correctly assign each set of MPS processes to the correct PIPE_DIRECTORY. My initial thought was to place the creation of the sub-directories (tmp/mps1, tmp/mps2, etc.) in my first for loop that is tied to the number of GPUs on the DGX, but this just hung.

The forum thread on that blog post is pretty stale (>2 years old) – I already posted there but haven’t gotten a response in almost a month so thought it was prudent to post it here.

Topic		Replies	Views
CUDA MPS Not Working as Expected in Multi-GPU Environment CUDA Setup and Installation	4	833	November 12, 2024
MPS Server is working with a single node multi-GPU but not working with two nodes multi-GPU CUDA Programming and Performance	0	817	March 28, 2024
Maximizing GROMACS Throughput with Multiple Simulations per GPU Using MPS and MIG Technical Blog	11	2477	April 14, 2025
Multiple MPS Daemons on one machine? CUDA Programming and Performance	0	412	June 26, 2019
MPI running issue using NVIDIA MPS Service on Multi-GPU nodes CUDA Programming and Performance	4	2343	September 16, 2016
Is these processes are computed parallelly using MPS? General	3	771	November 22, 2019
Maximizing OpenMM Molecular Dynamics Throughput with NVIDIA Multi-Process Service Technical Blog	1	111	June 4, 2025
Get a Segmentation fault in MPS CUDA Programming and Performance cuda	0	91	January 22, 2025
MPS on Turing architecture (GeForce RTX 2080) for jobs from multiple users CUDA Programming and Performance	3	1408	September 6, 2019
CUDA MPS not allowing new jobs to start CUDA Setup and Installation	2	1004	February 21, 2019

Guidance on setting MPS_PIPE_DIRECTORY for multiple jobs in loop

Related topics