Reset dedicated GPU after it gets stuck

skub · March 16, 2022, 11:00am

I have a setup with a dedicated NVIDIA for CUDA computing and another NVIDIA for display. From time to time, the CUDA dedicated NVIDIA gets stuck - the fan keeps running at 100% and nothing can be really done with it anymore. I can “solve” it by rebooting but I’d much prefer to solve it by resetting the CUDA dedicated card.

Here is the nvidia-smi output when the issue happens (the CUDA card is #0 but #1 is set as primary):

±----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce … Off | 00000000:41:00.0 Off | N/A |
|ERR! 41C P2 ERR! / 250W | 2MiB / 11264MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 1 NVIDIA GeForce … Off | 00000000:42:00.0 On | N/A |
| 40% 47C P8 10W / 75W | 479MiB / 4096MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 1 N/A N/A 4517 G /usr/libexec/Xorg 199MiB |
| 1 N/A N/A 4861 G /usr/bin/kwin_x11 1MiB |
| 1 N/A N/A 5515 G …akonadi_archivemail_agent 1MiB |
| 1 N/A N/A 5523 G …/akonadi_mailfilter_agent 17MiB |
| 1 N/A N/A 5526 G …n/akonadi_sendlater_agent 1MiB |
| 1 N/A N/A 5527 G …nadi_unifiedmailbox_agent 1MiB |
| 1 N/A N/A 1923084 G /usr/bin/plasmashell 71MiB |
| 1 N/A N/A 3052928 G …449555690282580582,131072 123MiB |

Thus, no processes are reported as running on the CUDA card. Yet, trying to reset the card returns:

vidia-smi --gpu-reset -i 0
GPU 00000000:41:00.0 is currently in use by another process.

1 device is currently being used by one or more other processes (e.g., Fabric Manager, CUDA application, graphics application such as an X server, or a monitoring application such as another instance of nvidia-smi). Please first kill all processes using this device and all compute applications running in the system.

nvidia-persistenced is not running (so that is not the blocking process)

Is there a way to reset the card without reboot?

skub · March 16, 2022, 11:42am

I have found that I can reset the card after killing Xorg (actually I used 'systemctl isolate multi-user.target). So, obviously, somehow Xorg still interferes with the dedicated card despite no processes are listed as running on the GPU.

After the Xorg stop and card reset, I am able to use the card as usual. However, I am still hoping there is a solution that would not require Xorg restart - that’s why I have the setup with CUDA dedicated GPU in the first place…

generix · March 17, 2022, 8:35am

Please check if this helps:
Check whether

sudo cat /sys/module/nvidia_drm/parameters/modeset

Returns ‘Y’, if so, run
run

grep nvidia /etc/modprobe.d/* /lib/modprobe.d/*

to find a file containing

options nvidia-drm modeset=1

and change 1 to 0
then run

sudo update-initramfs -u

and reboot.

sudo cat /sys/module/nvidia_drm/parameters/modeset

should return ‘N’ if done right.
Furthermore, you should monitor gpu temperatures and correctly set up nvidia-persistenced in order to prevent running into the error state.

skub · March 17, 2022, 10:40am

cat /sys/module/nvidia_drm/parameters/modeset
cat: /sys/module/nvidia_drm/parameters/modeset: No such file or directory

ls /sys/module/ | grep nvidia
nvidia
nvidia_modeset
nvidia_uvm

lsmod | grep nvidia
nvidia_uvm 1191936 0
nvidia_modeset 1163264 23
nvidia 39108608 1215 nvidia_uvm,nvidia_modeset
drm 630784 1 nvidia

find /sys -iname drm
/sys/kernel/btf/drm
/sys/kernel/tracing/events/drm
/sys/kernel/debug/tracing/events/drm
/sys/class/drm
/sys/module/drm

So I guess nvidia-drm is not the culprit.

I did not know that nvidia-persistenced could prevent running into the error state though, I thought the gain was just performance-wise. I have started it, lets see whether it helps indeed.

skub · March 18, 2022, 3:51pm

The issue still happens also if nvidia-persistenced is running. So still no way to either prevent the issue nor to sort it out without Xorg restart.

skub · March 18, 2022, 4:32pm

I have managed to remove all processes but Xorg from accessing the GPU:

lsof /dev/nvidia* returns nothing
fuser -v /dev/nvidia* returns just Xorg on /dev/nvidia0, /dev/nvidia1, /dev/nvidiactl, /dev/nvidia-modeset
nvidia-smi only shows a single process - Xorg running on card 1

Yet, reset it still not possible. So, the problem seems to be that despite nvidia-smi does not show Xorg running on card 0, it is still somehow connected to it.

skub · March 22, 2022, 2:08pm

Is there really no way to properly dedicate an NVIDIA card for CUDA computing? Would I have to stop using at least one NVIDIA card to achieve that? The current setup with two NVIDIA cards is almost unusable as I have to restart almost every day due to this issue.

avilella · August 30, 2023, 1:20pm

I ahve a similar issue with an Ubuntu Linux 22.04 tower workstation, with 2 Quadro P1000 cards connected via an Akitio Duo GPU and 2 NVIDIA T1000 GPUs directly plugged on the PCIE slots.
Whenever I kicked off a CUDA program, the external GPUs would disappear from the nvidia-smi list. I applied the change suggested by @generix, and it worked without any issues for about 5-10 minutes, then 2 of the 4 GPUs disappeared, and when trying to continue executing the tasks, I got this error:

File "/home/user/miniconda3/envs/RF2/lib/python3.10/site-packages/torch/_utils.py", line 81, in _cuda
    untyped_storage = torch.UntypedStorage(
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Topic		Replies	Views
Dedicate one Nvidia card for CUDA and another Nvidia card for Xorg? Linux	8	1506	March 18, 2022
One of two 1080Ti GPUs not detected after CUDA failure CUDA Setup and Installation	7	1414	April 27, 2018
Nvidia driver/CUDA installation causes centos 7 to hang on boot. unable to access user interface. CUDA Setup and Installation	29	28912	February 10, 2018
[SOLVED] Run CUDA on dedicated NVIDIA GPU while connecting monitors to Intel HD graphics, is this possible? CUDA Setup and Installation	15	71878	December 9, 2018
After installing CUDA 9.0 in POWER9(RHEL7), nvidia-smi shows Unknown Error in Memory_Usage column. CUDA Setup and Installation	18	3135	June 8, 2018
Cannot nvidia-smi Geforce 1070 anymore suddenly. Linux	9	1639	October 12, 2021
Deciphering an NVRM: Xid message? CUDA Programming and Performance	27	78089	April 1, 2012
Nvidia-persistenced: Failed to query NVIDIA devices Application Accelerator Software cuda , kernel , ubuntu	8	10712	August 18, 2023
Two GPUs, but 2nd GPU not detected. How to fix? CUDA Setup and Installation	10	15597	January 21, 2018
Nvidia-smi recognize H100 when Firmware is disable Confidential Computing cuda , ubuntu	10	543	September 11, 2024

Reset dedicated GPU after it gets stuck

Related topics