I am currently trying to adapt the mps.sh script that was provided in the blog post using MPS for high-throughput simulations of GROMACS to run openMM on a DGX A100 pod, with the goal of running the following hierarchy of simulations:
- unique protein-ligand complex per GPU (so total of 8 protein-ligand complexes)
- 16 copies of a respective protein-ligand complex on each A100
I am running in a SLURM environment – unfortunately the MPS documentation only has guidance for PBS-based submissions (and it is somewhat limited).
I have come across limited mentions of using MPS with openMM (issue #3082 and issue #2535) but it is pretty clear from those posts that full utilization of MPS wasn’t pursued.
The forum discussion for the GROMACS blog post gave some clues on attempting to tackle this problem. Alan Gray suggested the following:
It looks like you just need to set a unique MPS pipe directory for each job, before launching MPS.
To do this for your first job (using, e.g. /tmp/mps1 for the directory):
export CUDA_MPS_PIPE_DIRECTORY=/tmp/mps1
mkdir -p $CUDA_MPS_PIPE_DIRECTORY
nvidia-cuda-mps-control -dThen the second job on the same node should be able to use its GPU OK, and can also use MPS in a similar way, as long as it uses a different directory (e.g. /tmp/mps2).
I combined this suggestion with the mps script provided in the blog post along with the SLURM submission script I have been using to carry out these jobs to produce the following:
#!/bin/bash
# Demonstrator script to run multiple simulations per GPU with MPS on DGX-A100
#
# Alan Gray, NVIDIA
### SLURM SCHEDULER SETTINGS
## generic job settings for running in SLURM (w/GPUs)
## #SBATCH --job-name=sim_1 # job name (default is the name of this file)
## #SBATCH --output=log.%x.job_%j # file name for stdout/stderr (%x will be replaced with the job name, %j with the jobid)
#SBATCH --time=0:10:00 # maximum wall time allocated for the job (D-H:MM:SS)
#SBATCH --exclusive # request exclusive allocation of resources
#SBATCH --partition=026-partition # put the job into the gpu partition
## #SBATCH --mem=20G # RAM per node
#SBATCH --threads-per-core=1 # do not use hyperthreads (i.e. CPUs = physical cores below)
#SBATCH --cpus-per-task=2 # number of CPUs per process
## nodes allocation
#SBATCH --nodes=1 # number of nodes
## #SBATCH --ntasks-per-node=2 # MPI processes per node
## GPU allocation
#SBATCH --gpus-per-task=1 # number of GPUs per process
#SBATCH --gpu-bind=single:1 # bind each process to its own GPU (single:<tasks_per_gpu>)
source ~/.bashrc
# activate the conda environment
source activate modbind
# change to directory in which job is submitted
cd $SLURM_SUBMIT_DIR
# activate job management functions for stopping ModBind runs
source ./manage_resource.sh
source ./var_md.sh
# Location of ModBind environment
MBPY=/lustre/miniconda3/envs/modbind/bin/python
# Location of input files
CONFIG=/lustre/alivexis-data/modbind-testing/AURKA/simulateComplexWithSolvent.py
NGPU=${NUM_GPUS} # Number of GPUs in server
NCORE=128 # Number of CPU cores in server
NSIMPERGPU=${NUM_REPLICAS} # Number of simulations to run per GPU (with MPS)
# Number of threads per simulation
NTHREAD=$(($NCORE/($NGPU*$NSIMPERGPU)))
if [ $NTHREAD -eq 0 ]
then
NTHREAD=1
fi
export OMP_NUM_THREADS=$NTHREAD
# Start MPS daemon
nvidia-cuda-mps-control -d
# Loop over number of GPUs in server
for (( i=0; i<$NGPU; i++ ));
do
# Set a CPU NUMA specific to GPU in use with best affinity (specific to DGX-A100)
case $i in
0)NUMA=3;;
1)NUMA=2;;
2)NUMA=1;;
3)NUMA=0;;
4)NUMA=7;;
5)NUMA=6;;
6)NUMA=5;;
7)NUMA=4;;
esac
# set variables to point to pdb and sdf files -- assign 1:1 ligand:GPU
for f in `cat ${LISTDIR}/${LISTPREFIX}_1`
do
bname=$(basename $f | sed -e 's/.sdf//g')
echo ${bname}
output="${OUTPUTDIR}/${bname}"
# pull PDB name specific to protein-ligand complex
pdbname=$( echo $bname | sed -e 's/_/ /g' | awk '{print $NF}')
pdbfile="${PDBDIR}/${pdbname}.pdb"
# pull sdf filename for specific ligand
sdffile=${SDFDIR}/${bname}.sdf
# change to parent directory
cd ${output}
while true
do
# Loop over number of simulations per GPU
for (( j=0; j<$NSIMPERGPU; j++ ));
do
# Create a unique identifier for this simulation to use as a working directory
id=gpu${i}_sim${j}
rm -rf $id
mkdir -p $id
cd $id
# create symlink to ligand db.json GAFF file in child directory
#ln -s ../db.json .
cp ../db.json .
# Launch openMM in the background on the desired resources
echo "Launching simulation $j on GPU $i with $NTHREAD CPU thread(s) on NUMA region $NUMA"
CUDA_VISIBLE_DEVICES=$i numactl --cpunodebind=$NUMA $MBPY $CONFIG \
$pdbfile $sdffile mb-prod-$j 1000000 $j $i \
> mps-test-$j.log 2>&1 &
cd ..
done
cd ..
done
wait
done
done
echo "Waiting for simulations to complete..."
wait
I am using another lookup file to define the variables referenced in this script (e.g., NUM_GPUS = 8 and NUM_REPLICAS = 2).
Depending on the location of the second-to-last “wait” call, the MPS daemon will either assign two simulations (NUM_REPLICAS) to the first GPU and wait for those simulations to complete, or will assign all 16 simulations (8 GPUs * 2 REPLICAS per GPU) to GPU0.
Alan’s suggestion to create a distinct MPS_PIPE_DIRECTORY for each GPU makes sense to me, but I am stuck on how to point the MPS daemon to correctly assign each set of MPS processes to the correct PIPE_DIRECTORY. My initial thought was to place the creation of the sub-directories (tmp/mps1, tmp/mps2, etc.) in my first for loop that is tied to the number of GPUs on the DGX, but this just hung.
The forum thread on that blog post is pretty stale (>2 years old) – I already posted there but haven’t gotten a response in almost a month so thought it was prudent to post it here.