cudaFuncGetAttributes failed: out of memory

I’m afraid I’ve run into something rather strange. I’ve been using GROMACS on a GPU server and the performance was quite good. However a few days ago a fatal error suddenly occurred as:


Program: gmx mdrun, version 2019.4
Source file: src/gromacs/gpu_utils/gpu_utils.cu (line 100)

Fatal error:
cudaFuncGetAttributes failed: out of memory

For more information and tips for troubleshooting, please check the GROMACS
website at http://www.gromacs.org/Documentation/Errors

I can run other apps with GPU and the other modules in GROMACS still work but I cannot run GROMACS with GPU anymore. Sorry for posting this problem here, but it seems more like something wrong with CUDA in the server (access from GROMACS denied?) since I’ve reinstalled the GROMACS and still having the same error.

Please help me to solve this problem! (unfortunately I do not have the authority to reboot the server.) And the GPU information is as below:

±----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla P100-PCIE… On | 00000000:04:00.0 Off | 0 |
| N/A 30C P0 32W / 250W | 16008MiB / 16280MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 1 Tesla P100-PCIE… On | 00000000:06:00.0 Off | 0 |
| N/A 30C P0 27W / 250W | 10MiB / 16280MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 2 Tesla P100-PCIE… On | 00000000:07:00.0 Off | 0 |
| N/A 30C P0 32W / 250W | 16063MiB / 16280MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 3 Tesla P100-PCIE… On | 00000000:08:00.0 Off | 0 |
| N/A 32C P0 28W / 250W | 10MiB / 16280MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 4 Quadro P4000 On | 00000000:0B:00.0 Off | N/A |
| 46% 23C P8 8W / 105W | 12MiB / 8119MiB | 0% Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 20497 C /usr/bin/python3 5861MiB |
| 0 24503 C /usr/bin/python3 10137MiB |
| 2 23162 C /home/appuser/Miniconda3/bin/python 16049MiB |
±----------------------------------------------------------------------------+

Given that this problem has persisted for a few days already, have you tried requesting advice on the GROMACS mailing list?

I don’t know GROMACS’s GPU usage model in a multi-GPU environment. The output of nvidia-smi shows that pretty much all of the GPU memory on GPUs 0 and 2 has been grabbed by other compute processes, and a straightforward assumption would be that this leaves insufficient memory to run GROMACS on them. If these are zombie processes, you would want to kill them. If these are legitimate active processes, I would try restricting GROMACS to GPUs 1 and 3, which appear to be unused.

I’ve tried to restrict my tasks to other GPUs but still with the same error. I’ve also sent email to GROMACS mailing list and got no replies yet. As you mentioned, maybe it’s because of the GROMACS’s GPU usage model in a multi-GPU environment. Still asking for help…

I think I’ve temporarily solved this problem. Only when I use CUDA_VISIBLE_DEVICES to block GPUs 0 and 2, I can run GROMACS smoothly. I think there may be some bug in GROMACS’s GPU usage model in a multi-GPU environment, just as njuffa had mentioned.