Molecular dynamics simulations on GROMACS with CUDA runs slow midway through

Context:
I use GROMACS 2023.3 with CUDA 12.3 on my computer with a single RTX 4090.
I simulate lipid bilayer assembly processes for a few hundred nanoseconds; the tasks are relatively computationally demanding.

Problem:
When I run the simulation (with the “gmx mdrun” command; see example below), the initial estimated time to completion is about 1-2 days, which is reasonable. My computer can accomplish about 1000 timesteps in 1-2 seconds. However, I’ve noticed that after about 8-16 hours of running, the run slows down significantly, with only 100 steps being processed every 10 seconds or so; the estimated time to completion increases to 7+days. In short, my runs become ~two orders of magnitude less efficient midway through my simulations; this has happened for each long simulation I’ve run. Example command:

gmx mdrun -s test.tpr -v -x test.xtc -c test.gro -nb gpu -bonded gpu -pme gpu

Questions:
Do you have any advice to diagnose this problem? My first hypothesis is that perhaps my GPU is thermal throttling (at 86C) and being turned off midway through… though when I check GPU temp during the beginning of the run it runs at about 60-70C. If you also endorse this temperature hypothesis as the most plausible, is there a good/easy way to track temperature over time in my GPU?

Thanks for the help!

One possible approach:

while gromacs is running, in a separate console, try

nvidia-smi  -l --query-gpu temperature.gpu --format=csv

then use your linux shell skills to put that into a file.

The nvidia-smi tool has a lot of capability, you can start to learn more about it by asking for command-line help with -q.

Some sub-menus have help also, e.g. --help-query-gpu

You could also write a shell script that alternates running that command without the -l switch (which will only print once) and alternate running that in a loop with date or similar.

1 Like

Thanks for your help. I’ll work to troubleshoot the temperature. In the case that the problem is thermal throttling, do you have advice on good cooling solutions for the RTX 4090? This computer is a prebuilt Dell XPS 8960. Thanks again!

I don’t really have advice, per se.

If it were my computer, and all I cared about was thermal throttling I would open the case and point a nice big box fan at the open case motherboard.

I don’t really know if that is the problem or not. Another theoretical possibility is power throttling (nvidia-smi can also help with that although I’m not sure about GeForce GPUs) Power throttling seems a little less likely to me than thermal, but I don’t really know, just guessing.

1 Like

All good stuff to think about. Thanks, Robert. I’ll revisit this thread in the future if I have any revelations in the troubleshooting process, in case someone down the road can benefit from the solution to this.

While it is definitely worthwhile to look into the overheating hypothesis, the long delay until problem onset would seem to make that rather unlikely, provided the GPU load is applied approximately evenly over the length of a full simulation run.

I do not know what kind of enclosure this machine uses. In a typical workstation configuration, one would observe am actively-cooled GPU heating up fairly rapidly during the first five minutes under full compute load. Further heating occurs at a very slow pace as the rest of the system warms up. Thermally, steady-state should be reached after about 10 minutes.

I have not run GROMACS myself. It is conceivable that it applies computational load in multiple distinct phases, one or several of which are particularly power-hungry, and that this happens after 6-8 hours.

If you do find an overheating problem, make sure airflow in the system enclosure is not impeded by obstacles (like neighboring PCIe cards), and that all heat sinks and fans are free of adhering dust. Even in environments like my home office that is not particularly dust-laden I need to blow away accumulated dust about once a year. One can use compressed air for this (the stuff from a can, not from a compressor used to inflate tires) to do that. FWIW, this problem affects CPU cooling just as it does GPU cooling.

You could also ask about the observed issue in the GROMACS forum, as forum participants have plenty of experience with running GROMACS including all the problems commonly encountered when running the application.

I have been using Dell machines for 20+ years now. I am not familiar with the Dell XPS 8960 in particular, but my general experience is that Dell’s case cooling works well. My current machine has two beefy fans for that purpose. The only time they really kick into audible action is when I request a pre-boot hardware check from the BIOS after a restart.

1 Like

I resolved this issue.

Here’s the solution:

I edited my mdrun command to include -update gpu. Thus, the full command is as follows:

gmx mdrun -s test.tpr -v -x test.xtc -c test.gro -nb gpu -bonded gpu -pme gpu -nstlist 400

This aligns with the suggestions from this NVIDIA blog.

I initially did not include the -update gpu argument because I was using the Nose-Hoover themostat, which for whatever reason does not work with the -update argument. Because I was not committed to Nose-Hoover for any particular reason besides aligning with prior studies, I have opted to use the v-rescale thermostat in my .mdp file (as suggested here). v-rescale works with the -update gpu argument in mdrun.

The simulation now runs in the expected time, i.e., ~30 hours, with GPU usage throughout. So matter resolved.

(If curious, you can visit my post in the GROMACS forum to see images of my GPU temperature and power draw over time.)

Thanks for your help.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.