Molecular dynamics simulations on GROMACS with CUDA runs slow midway through

evanc1 · March 27, 2024, 4:06am

Context:
I use GROMACS 2023.3 with CUDA 12.3 on my computer with a single RTX 4090.
I simulate lipid bilayer assembly processes for a few hundred nanoseconds; the tasks are relatively computationally demanding.

Problem:
When I run the simulation (with the “gmx mdrun” command; see example below), the initial estimated time to completion is about 1-2 days, which is reasonable. My computer can accomplish about 1000 timesteps in 1-2 seconds. However, I’ve noticed that after about 8-16 hours of running, the run slows down significantly, with only 100 steps being processed every 10 seconds or so; the estimated time to completion increases to 7+days. In short, my runs become ~two orders of magnitude less efficient midway through my simulations; this has happened for each long simulation I’ve run. Example command:

gmx mdrun -s test.tpr -v -x test.xtc -c test.gro -nb gpu -bonded gpu -pme gpu

Questions:
Do you have any advice to diagnose this problem? My first hypothesis is that perhaps my GPU is thermal throttling (at 86C) and being turned off midway through… though when I check GPU temp during the beginning of the run it runs at about 60-70C. If you also endorse this temperature hypothesis as the most plausible, is there a good/easy way to track temperature over time in my GPU?

Thanks for the help!

Robert_Crovella · March 27, 2024, 4:14am

One possible approach:

while gromacs is running, in a separate console, try

nvidia-smi  -l --query-gpu temperature.gpu --format=csv

then use your linux shell skills to put that into a file.

The nvidia-smi tool has a lot of capability, you can start to learn more about it by asking for command-line help with -q.

Some sub-menus have help also, e.g. --help-query-gpu

You could also write a shell script that alternates running that command without the -l switch (which will only print once) and alternate running that in a loop with date or similar.

evanc1 · March 27, 2024, 4:24am

Thanks for your help. I’ll work to troubleshoot the temperature. In the case that the problem is thermal throttling, do you have advice on good cooling solutions for the RTX 4090? This computer is a prebuilt Dell XPS 8960. Thanks again!

Robert_Crovella · March 27, 2024, 4:30am

I don’t really have advice, per se.

If it were my computer, and all I cared about was thermal throttling I would open the case and point a nice big box fan at the open case motherboard.

I don’t really know if that is the problem or not. Another theoretical possibility is power throttling (nvidia-smi can also help with that although I’m not sure about GeForce GPUs) Power throttling seems a little less likely to me than thermal, but I don’t really know, just guessing.

evanc1 · March 27, 2024, 4:33am

All good stuff to think about. Thanks, Robert. I’ll revisit this thread in the future if I have any revelations in the troubleshooting process, in case someone down the road can benefit from the solution to this.

njuffa · March 27, 2024, 4:36am

While it is definitely worthwhile to look into the overheating hypothesis, the long delay until problem onset would seem to make that rather unlikely, provided the GPU load is applied approximately evenly over the length of a full simulation run.

I do not know what kind of enclosure this machine uses. In a typical workstation configuration, one would observe am actively-cooled GPU heating up fairly rapidly during the first five minutes under full compute load. Further heating occurs at a very slow pace as the rest of the system warms up. Thermally, steady-state should be reached after about 10 minutes.

I have not run GROMACS myself. It is conceivable that it applies computational load in multiple distinct phases, one or several of which are particularly power-hungry, and that this happens after 6-8 hours.

If you do find an overheating problem, make sure airflow in the system enclosure is not impeded by obstacles (like neighboring PCIe cards), and that all heat sinks and fans are free of adhering dust. Even in environments like my home office that is not particularly dust-laden I need to blow away accumulated dust about once a year. One can use compressed air for this (the stuff from a can, not from a compressor used to inflate tires) to do that. FWIW, this problem affects CPU cooling just as it does GPU cooling.

You could also ask about the observed issue in the GROMACS forum, as forum participants have plenty of experience with running GROMACS including all the problems commonly encountered when running the application.

I have been using Dell machines for 20+ years now. I am not familiar with the Dell XPS 8960 in particular, but my general experience is that Dell’s case cooling works well. My current machine has two beefy fans for that purpose. The only time they really kick into audible action is when I request a pre-boot hardware check from the BIOS after a restart.

evanc1 · March 31, 2024, 5:40pm

I resolved this issue.

Here’s the solution:

I edited my mdrun command to include -update gpu. Thus, the full command is as follows:

gmx mdrun -s test.tpr -v -x test.xtc -c test.gro -nb gpu -bonded gpu -pme gpu -nstlist 400

This aligns with the suggestions from this NVIDIA blog.

I initially did not include the -update gpu argument because I was using the Nose-Hoover themostat, which for whatever reason does not work with the -update argument. Because I was not committed to Nose-Hoover for any particular reason besides aligning with prior studies, I have opted to use the v-rescale thermostat in my .mdp file (as suggested here). v-rescale works with the -update gpu argument in mdrun.

The simulation now runs in the expected time, i.e., ~30 hours, with GPU usage throughout. So matter resolved.

(If curious, you can visit my post in the GROMACS forum to see images of my GPU temperature and power draw over time.)

Thanks for your help.

Topic		Replies	Views
GROMACS Molecular Dynamics simulations run increasingly slower as simulation progresses CUDA Programming and Performance cuda , ubuntu	3	670	August 25, 2024
Fatal error: Unexpected cudaStreamQuery failure: unspecified launch failure CUDA Programming and Performance cuda	3	1081	August 3, 2022
RTX A4000 for MD Simulations CUDA Programming and Performance	8	1842	August 25, 2022
CUDA performance degradation as the GPU card heats up CUDA Programming and Performance	6	1351	July 9, 2024
Maximizing GROMACS Throughput with Multiple Simulations per GPU Using MPS and MIG Technical Blog	11	2549	April 14, 2025
"no CUDA-capable device is available" after 2 hours simulation CUDA Programming and Performance	2	9360	March 2, 2010
A Guide to CUDA Graphs in GROMACS 2023 Technical Blog	1	809	July 18, 2023
Why GPU might slow down. I'm having a problem with a CUDA program slowing down CUDA Programming and Performance	2	1901	December 22, 2010
Best graphics card for running gromacs CUDA Programming and Performance	9	1924	August 31, 2020
Molecular Dynamics Simulation could not integrate Nvidia GTX 1660 with our GROMACS Linux cuda , ubuntu	0	354	September 24, 2021

Molecular dynamics simulations on GROMACS with CUDA runs slow midway through

Related topics