I have a setup with a dedicated NVIDIA for CUDA computing and another NVIDIA for display. From time to time, the CUDA dedicated NVIDIA gets stuck - the fan keeps running at 100% and nothing can be really done with it anymore. I can “solve” it by rebooting but I’d much prefer to solve it by resetting the CUDA dedicated card.
Here is the nvidia-smi output when the issue happens (the CUDA card is #0 but #1 is set as primary):
±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 1 N/A N/A 4517 G /usr/libexec/Xorg 199MiB |
| 1 N/A N/A 4861 G /usr/bin/kwin_x11 1MiB |
| 1 N/A N/A 5515 G …akonadi_archivemail_agent 1MiB |
| 1 N/A N/A 5523 G …/akonadi_mailfilter_agent 17MiB |
| 1 N/A N/A 5526 G …n/akonadi_sendlater_agent 1MiB |
| 1 N/A N/A 5527 G …nadi_unifiedmailbox_agent 1MiB |
| 1 N/A N/A 1923084 G /usr/bin/plasmashell 71MiB |
| 1 N/A N/A 3052928 G …449555690282580582,131072 123MiB |
Thus, no processes are reported as running on the CUDA card. Yet, trying to reset the card returns:
vidia-smi --gpu-reset -i 0
GPU 00000000:41:00.0 is currently in use by another process.
1 device is currently being used by one or more other processes (e.g., Fabric Manager, CUDA application, graphics application such as an X server, or a monitoring application such as another instance of nvidia-smi). Please first kill all processes using this device and all compute applications running in the system.
nvidia-persistenced is not running (so that is not the blocking process)
I have found that I can reset the card after killing Xorg (actually I used 'systemctl isolate multi-user.target). So, obviously, somehow Xorg still interferes with the dedicated card despite no processes are listed as running on the GPU.
After the Xorg stop and card reset, I am able to use the card as usual. However, I am still hoping there is a solution that would not require Xorg restart - that’s why I have the setup with CUDA dedicated GPU in the first place…
should return ‘N’ if done right.
Furthermore, you should monitor gpu temperatures and correctly set up nvidia-persistenced in order to prevent running into the error state.
I did not know that nvidia-persistenced could prevent running into the error state though, I thought the gain was just performance-wise. I have started it, lets see whether it helps indeed.
I have managed to remove all processes but Xorg from accessing the GPU:
lsof /dev/nvidia* returns nothing
fuser -v /dev/nvidia* returns just Xorg on /dev/nvidia0, /dev/nvidia1, /dev/nvidiactl, /dev/nvidia-modeset
nvidia-smi only shows a single process - Xorg running on card 1
Yet, reset it still not possible. So, the problem seems to be that despite nvidia-smi does not show Xorg running on card 0, it is still somehow connected to it.
Is there really no way to properly dedicate an NVIDIA card for CUDA computing? Would I have to stop using at least one NVIDIA card to achieve that? The current setup with two NVIDIA cards is almost unusable as I have to restart almost every day due to this issue.