Maximizing GROMACS Throughput with Multiple Simulations per GPU Using MPS and MIG

Originally published at:

In this post, we demonstrate the benefits of running multiple simulations per GPU for GROMACS and show how MPS can achieve up to 1.8X overall improvement in throughput.

1 Like

Hello Alan and Szilárd,

Thanks for the very useful post. I have tried implementing the MPS on V100s and have seen a massive improvement in the performance.

I am facing an issue with using MPS on nodes with multiple GPUs (two GPUs). I would like your help with the same:

I am using a job scheduler (qsub) to submit a gromacs simulation run. Each job requests 1xV100+4xCPUs. I run one independent simulation on each CPU core. Therefore I have 4 parallel ongoing runs. The sample command is as follows:

for i in {1..4}
    cd $i
    nvidia-cuda-mps-control -d 
    bash &
    cd ..

The above command runs perfect and I am able to see nvidia-cuda-mps-server and four gmx_mpi processes on GPU 0.

However, the issue arises when the job scheduler assigns another job on the same node. Note that nodes have 2 GPUs, therefore it assigns the jobs to the other GPU (GPU 1). (The same resources are requested and the same script as above was run).

The messages/errors on using gmx mdrun are as follows:

Program:     gmx mdrun, version 2021.4-plumed-2.7.3
Source file: src/gromacs/taskassignment/findallgputasks.cpp (line 86)

Fatal error:
Cannot run short-ranged nonbonded interactions on a GPU because no GPU is

For more information and tips for troubleshooting, please check the GROMACS
website at

I also notice this in the log file:

CUDA compiler:      /home/soft/cuda-11.0.2/bin/nvcc nvcc: NVIDIA (R) Cuda compiler driver;Copyright (c) 2005-2020 NVIDIA Corporation;Built on Thu_Jun_11_22:26:38_PDT_2020;Cuda compilation tools, release 11.0, V11.0.194;Build cuda_11.0_bu.TC445_37.28540450_0
CUDA compiler flags:-std=c++17;-gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_70,code=sm_70;-use_fast_math;;-mavx2 -mfma -Wno-missing-field-initializers -fexcess-precision=fast -funroll-all-loops -fopenmp -O3 -DNDEBUG
CUDA driver:        11.40
CUDA runtime:       N/A

The CUDA runtime shows N/A. While for the first job, it shows 11.0.

Please note the following:

  1. qsub does not have any flags for managing mps.
  2. The error does not occur when mps is not activated.
  3. The error does not occur if the flags -nb gpu -bonded gpu -pme gpu -update gpu are skipped even if mps was already activated.

I would really appreciate it if you can help resolve my query. Let me know if you need additional details.

PhD student,


1 Like

I am new to parallelization and cannot understand running the same job multiple times on the same GPU. I want to run multiple jobs on the same GPU and thus cannot comprehend $INPUT file manipulation.
I hope you can help me.

Hi Akshay, can you please provide me with your script

Hi Akshay,

In your script, are you setting the CUDA_VISIBLE_DEVICES environment variable? If so, please can you try removing that. When you launch each job with 1xV100, I expect that each device will be available as (the default) GPU 0 in each of the jobs (you can check this with nvidia-smi), such that setting CUDA_VISIBLE_DEVICES to any other value would case the error your see. If you still get the error, then I’m not sure of the cause at the moment but can try and reproduce internally.

One other thing to try is launching jobs with multiple GPUs in each job, and using CUDA_VISIBLE_DEVICES in a similar way to that shown in the script in the blog.

Best regards,


Hi Ravis,

The relevant lines in the first script given in the blog are L45-51, where I create a new directory specific to each simulation, and copy the (same) input file into that directory. The directory naming structure I use is gpu${i}_sim${j}, such that e.g. for 2 simulations on each of 2 GPUs we would have 4 directories, gpu0_sim0, gpu0_sim1, gpu1_sim0 and gpu1_sim1.

In your case, of course you will want to use a different input file in each directory. I suggest to set up these directories in advance, each with the appropriate input file(s), and then for each simulation, simply “cd” into the appropriate pre-existing directory to run the simulation (i.e. remove lines 47,48 and 51 but keep lines 46 and 49).

Best regards,


Hello Dr Alan,

I appreciate your response to my queries.

No, I am not setting the CUDA_VISIBLE_DEVICES environment variable.
(Though I had also tried running mdrun after setting this as detailed in your blog.)

The file solely consists of:

module load apps/gromacs/2021.4/gnu

mpirun -np 1 gmx_mpi mdrun -v -s md.tpr -o md.trr -x md.xtc -cpo md.cpt -e md.edr -g md.log -c md.gro -ntomp 1 -nstlist 150 -nb gpu -bonded gpu -pme gpu -update gpu

I have tried launching jobs with multiple GPUs and used the CUDA_VISIBLE_DEVICES variable. This had worked as expected without errors. The simulations were running on GPU_ID 0 or 1 based on our CUDA_VISIBLE_DEVICES variable used with gmx mdrun.

Some observations:

  1. No user is able to use the second GPU using -nb gpu -bonded gpu -pme gpu -update gpu when MPS was activated by someone on the first GPU.
  2. GROMACS only uses CPUs when -nb gpu -bonded gpu -pme gpu -update gpu flags are skipped on the second GPU jobs when MPS is already running on the first GPU. Therefore we don’t see the “no GPU is detected” error.

I am attaching the tpr file, in case you would like to test them at your end.
md.tpr (6.1 MB)

Thank you,

Hi Akshay,

Thanks for the info. It looks like you just need to set a unique MPS pipe directory for each job, before launching MPS.

To do this for your first job (using, e.g. /tmp/mps1 for the directory):

export CUDA_MPS_PIPE_DIRECTORY=/tmp/mps1
nvidia-cuda-mps-control -d

Then the second job on the same node should be able to use its GPU OK, and can also use MPS in a similar way, as long as it uses a different directory (e.g. /tmp/mps2).

Best regards,


Hello Dr Alan,

Thank you very much for the suggestion. I will update you in case the problem remains unresolved.