Maximizing GROMACS Throughput with Multiple Simulations per GPU Using MPS and MIG

Originally published at: https://developer.nvidia.com/blog/maximizing-gromacs-throughput-with-multiple-simulations-per-gpu-using-mps-and-mig/

In this post, we demonstrate the benefits of running multiple simulations per GPU for GROMACS and show how MPS can achieve up to 1.8X overall improvement in throughput.

1 Like

Hello Alan and Szilárd,

Thanks for the very useful post. I have tried implementing the MPS on V100s and have seen a massive improvement in the performance.

I am facing an issue with using MPS on nodes with multiple GPUs (two GPUs). I would like your help with the same:

I am using a job scheduler (qsub) to submit a gromacs simulation run. Each job requests 1xV100+4xCPUs. I run one independent simulation on each CPU core. Therefore I have 4 parallel ongoing runs. The sample command is as follows:

for i in {1..4}
do
    cd $i
    nvidia-cuda-mps-control -d 
    bash simulation.sh &
    cd ..
done

The above command runs perfect and I am able to see nvidia-cuda-mps-server and four gmx_mpi processes on GPU 0.

However, the issue arises when the job scheduler assigns another job on the same node. Note that nodes have 2 GPUs, therefore it assigns the jobs to the other GPU (GPU 1). (The same resources are requested and the same script as above was run).

The messages/errors on using gmx mdrun are as follows:

-------------------------------------------------------
Program:     gmx mdrun, version 2021.4-plumed-2.7.3
Source file: src/gromacs/taskassignment/findallgputasks.cpp (line 86)

Fatal error:
Cannot run short-ranged nonbonded interactions on a GPU because no GPU is
detected.

For more information and tips for troubleshooting, please check the GROMACS
website at http://www.gromacs.org/Documentation/Errors
------------------------------------------------------

I also notice this in the log file:

CUDA compiler:      /home/soft/cuda-11.0.2/bin/nvcc nvcc: NVIDIA (R) Cuda compiler driver;Copyright (c) 2005-2020 NVIDIA Corporation;Built on Thu_Jun_11_22:26:38_PDT_2020;Cuda compilation tools, release 11.0, V11.0.194;Build cuda_11.0_bu.TC445_37.28540450_0
CUDA compiler flags:-std=c++17;-gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_70,code=sm_70;-use_fast_math;;-mavx2 -mfma -Wno-missing-field-initializers -fexcess-precision=fast -funroll-all-loops -fopenmp -O3 -DNDEBUG
CUDA driver:        11.40
CUDA runtime:       N/A

The CUDA runtime shows N/A. While for the first job, it shows 11.0.

Please note the following:

  1. qsub does not have any flags for managing mps.
  2. The error does not occur when mps is not activated.
  3. The error does not occur if the flags -nb gpu -bonded gpu -pme gpu -update gpu are skipped even if mps was already activated.

I would really appreciate it if you can help resolve my query. Let me know if you need additional details.

Akshay,
PhD student,

Thanks,

1 Like

I am new to parallelization and cannot understand running the same job multiple times on the same GPU. I want to run multiple jobs on the same GPU and thus cannot comprehend $INPUT file manipulation.
I hope you can help me.

Hi Akshay, can you please provide me with your simulation.sh script

Hi Akshay,

In your simulation.sh script, are you setting the CUDA_VISIBLE_DEVICES environment variable? If so, please can you try removing that. When you launch each job with 1xV100, I expect that each device will be available as (the default) GPU 0 in each of the jobs (you can check this with nvidia-smi), such that setting CUDA_VISIBLE_DEVICES to any other value would case the error your see. If you still get the error, then I’m not sure of the cause at the moment but can try and reproduce internally.

One other thing to try is launching jobs with multiple GPUs in each job, and using CUDA_VISIBLE_DEVICES in a similar way to that shown in the script in the blog.

Best regards,

Alan

Hi Ravis,

The relevant lines in the first script given in the blog are L45-51, where I create a new directory specific to each simulation, and copy the (same) input file into that directory. The directory naming structure I use is gpu${i}_sim${j}, such that e.g. for 2 simulations on each of 2 GPUs we would have 4 directories, gpu0_sim0, gpu0_sim1, gpu1_sim0 and gpu1_sim1.

In your case, of course you will want to use a different input file in each directory. I suggest to set up these directories in advance, each with the appropriate input file(s), and then for each simulation, simply “cd” into the appropriate pre-existing directory to run the simulation (i.e. remove lines 47,48 and 51 but keep lines 46 and 49).

Best regards,

Alan

Hello Dr Alan,

I appreciate your response to my queries.

1)
No, I am not setting the CUDA_VISIBLE_DEVICES environment variable.
(Though I had also tried running mdrun after setting this as detailed in your blog.)

The simulation.sh file solely consists of:

module load apps/gromacs/2021.4/gnu
export OMP_NUM_THREADS=1

mpirun -np 1 gmx_mpi mdrun -v -s md.tpr -o md.trr -x md.xtc -cpo md.cpt -e md.edr -g md.log -c md.gro -ntomp 1 -nstlist 150 -nb gpu -bonded gpu -pme gpu -update gpu

2)
I have tried launching jobs with multiple GPUs and used the CUDA_VISIBLE_DEVICES variable. This had worked as expected without errors. The simulations were running on GPU_ID 0 or 1 based on our CUDA_VISIBLE_DEVICES variable used with gmx mdrun.

Some observations:

  1. No user is able to use the second GPU using -nb gpu -bonded gpu -pme gpu -update gpu when MPS was activated by someone on the first GPU.
  2. GROMACS only uses CPUs when -nb gpu -bonded gpu -pme gpu -update gpu flags are skipped on the second GPU jobs when MPS is already running on the first GPU. Therefore we don’t see the “no GPU is detected” error.

I am attaching the tpr file, in case you would like to test them at your end.
md.tpr (6.1 MB)

Thank you,
Akshay.

Hi Akshay,

Thanks for the info. It looks like you just need to set a unique MPS pipe directory for each job, before launching MPS.

To do this for your first job (using, e.g. /tmp/mps1 for the directory):

export CUDA_MPS_PIPE_DIRECTORY=/tmp/mps1
mkdir -p $CUDA_MPS_PIPE_DIRECTORY
nvidia-cuda-mps-control -d

Then the second job on the same node should be able to use its GPU OK, and can also use MPS in a similar way, as long as it uses a different directory (e.g. /tmp/mps2).

Best regards,

Alan

Hello Dr Alan,

Thank you very much for the suggestion. I will update you in case the problem remains unresolved.

Regards,

Akshay

I have almost reproduced the results that are given in this post for launching multiple simulations on the A100 GPU using MPS and MIG. Now, I am trying to use the Nvidia Nsights profiler to get a deeper understanding of the system state, GPUs, memory copies, kernel execution times etc. However I am not finding any resource or article that could help in creating the profiles. I tried browsing through the documentation of the nsight tool and tried a lot of different methods to create profiles, however nvtx events are not captured along with some other events. Is there a blog/article which explains how this can be done? Thanks in advance for the help!

Hi Dr. Alan,

I am able to run the script and get an output of 32 .xtc files. I want to concatenate these separate files into one xtc file that covers the entire simulation. I’ve tried this using the gromacs trjcat utility as well as concatenating using VMD, but the xtc files do not appear to be in chronological order where gpu0_sim0 is the first portion of the trajectory and gpu0_sim1 is the next and so on. The position of the protein makes a large jump when moving from the last frame of one xtc file to the first frame of the next file. Is there a way to stitch all the separate xtc files together so it is one contiguous trajectory?

Thank you,
Reuben

@alang Thanks for providing these scripts and information on incorporation of MPS with Gromacs. I am currently trying to adapt the mps.sh script to run openMM on a DGX A100 pod, with the goal of running the following hierarchy of simulations:

  • unique protein-ligand complex per GPU (so total of 8 protein-ligand complexes)
  • 16 copies of a respective protein-ligand complex on each A100

I have come across limited mentions of using MPS with openMM (issue #3082 and issue #2535) but it is pretty clear from those posts that full utilization of MPS wasn’t pursued.

As per Alan’s suggestion to Akshay above,

To do this for your first job (using, e.g. /tmp/mps1 for the directory):

export CUDA_MPS_PIPE_DIRECTORY=/tmp/mps1
mkdir -p $CUDA_MPS_PIPE_DIRECTORY
nvidia-cuda-mps-control -d

Then the second job on the same node should be able to use its GPU OK, and can also use MPS in a similar way, as long as it uses a different directory (e.g. /tmp/mps2).

I attempted to implement this suggestion by incorporating creation of 8 separate CUDA_MPS_PIPE_DIRECTORY folders for each of the 8 A100 GPUs into the initial for loop in the mps.sh script. Below are the relevant code blocks:

NGPU=8 # Number of GPUs in server
NCORE=128 # Number of CPU cores in server

NSIMPERGPU=16 # Number of simulations to run per GPU (with MPS)

# Number of threads per simulation
NTHREAD=$(($NCORE/($NGPU*$NSIMPERGPU)))
if [ $NTHREAD -eq 0 ]
then
    NTHREAD=1
fi
export OMP_NUM_THREADS=$NTHREAD

# Start MPS daemon
#nvidia-cuda-mps-control -d

# Loop over number of GPUs in server
for (( i=0; i<$NGPU; i++ ));
do
    # set temporary directory for mps bookkeeping
    export CUDA_MPS_PIPE_DIRECTORY=/tmp/mps$i
    mkdir -p $CUDA_MPS_PIPE_DIRECTORY
    nvidia-cuda-mps-control -d
    # Set a CPU NUMA specific to GPU in use with best affinity (specific to DGX-A100)
    case $i in
        0)NUMA=3;;
        1)NUMA=2;;
        2)NUMA=1;;
        3)NUMA=0;;
        4)NUMA=7;;
        5)NUMA=6;;
        6)NUMA=5;;
        7)NUMA=4;;
    esac

    # set variables to point to pdb and sdf files -- assign 1:1 ligand:GPU
    for f in `cat ${LISTDIR}/${LISTPREFIX}_$i`
    do
        bname=$(basename $f | sed -e 's/.sdf//g')
        echo ${bname}
        output="${OUTPUTDIR}/${bname}"
        # pull PDB name specific to protein-ligand complex
        pdbname=$( echo $bname | sed -e 's/_/ /g' | awk '{print $NF}')
        pdbfile="${PDBDIR}/${pdbname}.pdb"
        # pull sdf filename for specific ligand
        sdffile=${SDFDIR}/${bname}.sdf
        # change to parent directory
        cd ${output}

        while true
        do

            # Loop over number of simulations per GPU
            for (( j=0; j<$NSIMPERGPU; j++ ));
            do
                # Create a unique identifier for this simulation to use as a working directory
                id=gpu${i}_sim${j}
                rm -rf $id
                mkdir -p $id
                cd $id
                # create symlink to ligand db.json GAFF file in child directory
                ln -s ../db.json .

                # Launch openMM in the background on the desired resources
                echo "Launching simulation $j on GPU $i with $NTHREAD CPU thread(s) on NUMA region $NUMA"
                CUDA_VISIBLE_DEVICES=$i numactl --cpunodebind=$NUMA $MBPY $CONFIG \
                                    $pdbfile $sdffile mb-prod-$j 2000000 $j $i \
                                    > mps-test-$j.log 2>&1 &
                cd ..
            done
        done
        cd ..
    done
done
echo "Waiting for simulations to complete..."
wait

However, when I go to run the script, it looks like all 128 of my simulations are being assigned to GPU1:

./mps-test-single-ligand.sh 
Warning: Failed writing log files to directory [/var/log/nvidia-mps]. No logs will be available.
An instance of this daemon is already running
cat: /lustre/alivexis-data/modbind-testing/AURKA/lists/assign_list_0: No such file or directory
Warning: Failed writing log files to directory [/var/log/nvidia-mps]. No logs will be available.
An instance of this daemon is already running
Compound_1000-1_2XRU
Launching simulation 0 on GPU 1 with 1 CPU thread(s) on NUMA region 2
Launching simulation 1 on GPU 1 with 1 CPU thread(s) on NUMA region 2
Launching simulation 2 on GPU 1 with 1 CPU thread(s) on NUMA region 2
Launching simulation 3 on GPU 1 with 1 CPU thread(s) on NUMA region 2
Launching simulation 4 on GPU 1 with 1 CPU thread(s) on NUMA region 2
Launching simulation 5 on GPU 1 with 1 CPU thread(s) on NUMA region 2
Launching simulation 6 on GPU 1 with 1 CPU thread(s) on NUMA region 2
Launching simulation 7 on GPU 1 with 1 CPU thread(s) on NUMA region 2
Launching simulation 8 on GPU 1 with 1 CPU thread(s) on NUMA region 2
Launching simulation 9 on GPU 1 with 1 CPU thread(s) on NUMA region 2
Launching simulation 10 on GPU 1 with 1 CPU thread(s) on NUMA region 2
Launching simulation 11 on GPU 1 with 1 CPU thread(s) on NUMA region 2
Launching simulation 12 on GPU 1 with 1 CPU thread(s) on NUMA region 2
Launching simulation 13 on GPU 1 with 1 CPU thread(s) on NUMA region 2
Launching simulation 14 on GPU 1 with 1 CPU thread(s) on NUMA region 2
Launching simulation 15 on GPU 1 with 1 CPU thread(s) on NUMA region 2
Launching simulation 0 on GPU 1 with 1 CPU thread(s) on NUMA region 2
Launching simulation 1 on GPU 1 with 1 CPU thread(s) on NUMA region 2
Launching simulation 2 on GPU 1 with 1 CPU thread(s) on NUMA region 2
Launching simulation 3 on GPU 1 with 1 CPU thread(s) on NUMA region 2
Launching simulation 4 on GPU 1 with 1 CPU thread(s) on NUMA region 2
Launching simulation 5 on GPU 1 with 1 CPU thread(s) on NUMA region 2
Launching simulation 6 on GPU 1 with 1 CPU thread(s) on NUMA region 2
Launching simulation 7 on GPU 1 with 1 CPU thread(s) on NUMA region 2
Launching simulation 8 on GPU 1 with 1 CPU thread(s) on NUMA region 2
Launching simulation 9 on GPU 1 with 1 CPU thread(s) on NUMA region 2
Launching simulation 10 on GPU 1 with 1 CPU thread(s) on NUMA region 2
Launching simulation 11 on GPU 1 with 1 CPU thread(s) on NUMA region 2
Launching simulation 12 on GPU 1 with 1 CPU thread(s) on NUMA region 2
Launching simulation 13 on GPU 1 with 1 CPU thread(s) on NUMA region 2
Launching simulation 14 on GPU 1 with 1 CPU thread(s) on NUMA region 2
Launching simulation 15 on GPU 1 with 1 CPU thread(s) on NUMA region 2
Launching simulation 0 on GPU 1 with 1 CPU thread(s) on NUMA region 2
Launching simulation 1 on GPU 1 with 1 CPU thread(s) on NUMA region 2
Launching simulation 2 on GPU 1 with 1 CPU thread(s) on NUMA region 2
Launching simulation 3 on GPU 1 with 1 CPU thread(s) on NUMA region 2
Launching simulation 4 on GPU 1 with 1 CPU thread(s) on NUMA region 2
Launching simulation 5 on GPU 1 with 1 CPU thread(s) on NUMA region 2
Launching simulation 6 on GPU 1 with 1 CPU thread(s) on NUMA region 2
Launching simulation 7 on GPU 1 with 1 CPU thread(s) on NUMA region 2
Launching simulation 8 on GPU 1 with 1 CPU thread(s) on NUMA region 2
Launching simulation 9 on GPU 1 with 1 CPU thread(s) on NUMA region 2
Launching simulation 10 on GPU 1 with 1 CPU thread(s) on NUMA region 2
Launching simulation 11 on GPU 1 with 1 CPU thread(s) on NUMA region 2
Launching simulation 12 on GPU 1 with 1 CPU thread(s) on NUMA region 2
Launching simulation 13 on GPU 1 with 1 CPU thread(s) on NUMA region 2
Launching simulation 14 on GPU 1 with 1 CPU thread(s) on NUMA region 2
Launching simulation 15 on GPU 1 with 1 CPU thread(s) on NUMA region 2

I know this is an old thread, but directly relevant to what I am trying to do. And there are no threads on the developer forum that specifically address openMM and MPS. All of the above errors were produced while I was directly logged onto the GPU node. Eventually I would like to adapt this to a slurm submission script that can be used on the login node, but I need to walk before I can run. Thanks for your help.

Blake