This is the same observation as I posted in another thread on disappearing OpenCL icd on Ubuntu 22.04
after some extended use on a Ubuntu 22.04 box, I noticed that CUDA programs also fails to find NVIDIA GPUs when this happens. Similar to my observation for my OpenCL code, my CUDA code also fails to list NVIDIA GPU and get a -999 unknown error from CUDA.
I’ve been observing this behavior for a few months now. After a fresh reboot, usually the NVIDIA GPU can function correctly for a few days under both OpenCL and CUDA. However, after a few suspend+wakeup cycles, some times after 1-2 days, some times after 3-4 days, the NVIDIA GPU disappears from both CUDA and OpenCL (including clinfo
output), despite that nvidia-smi
can still list the device.
I observed this on RTX 2060 and RTX 4090, both on Ubuntu 22.04. Same driver version (52x or 53x) works fine on older versions of Ubuntu (20.04 and 18.04). So, it seems something in Ubuntu 22.04 or kernel 5.15 has some issue with NVIDIA drivers.
Using strace
, I captured the system calls when the GPU is working vs when it broke, and I found that the difference is that when the GPU disappears, reading-writing char-device /dev/char/504:0
, which is a symbolic link to /dev/nvidia-uvm
gives a “ENOENT (No such file or directory)” error, and reading-writing /dev/nvidia-uvm
directly gives a EIO (Input/output error) error.
here is a snippet of the log printed when trying to list CUDA devices using my program mcx
strace ./mcx -L
....
openat(AT_FDCWD, "/proc/devices", O_RDONLY) = 4
newfstatat(4, "", {st_mode=S_IFREG|0444, st_size=0, ...}, AT_EMPTY_PATH) = 0
read(4, "Character devices:\n 1 mem\n 4 /"..., 1024) = 864
close(4) = 0
stat("/dev/nvidia-uvm", {st_mode=S_IFCHR|0666, st_rdev=makedev(0x1f8, 0), ...}) = 0
stat("/dev/nvidia-uvm", {st_mode=S_IFCHR|0666, st_rdev=makedev(0x1f8, 0), ...}) = 0
unlink("/dev/char/504:0") = -1 ENOENT (No such file or directory)
symlink("../nvidia-uvm", "/dev/char/504:0") = -1 EACCES (Permission denied) <==== this line
stat("/dev/char/504:0", 0x7fffbc245760) = -1 ENOENT (No such file or directory) <==== this line
stat("/usr/bin/nvidia-modprobe", {st_mode=S_IFREG|S_ISUID|S_ISGID|0755, st_size=47320, ...}) = 0
geteuid() = 1000
clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f6046eee2d0) = 194037
wait4(194037, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 194037
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=194037, si_uid=1000, si_status=0, si_utime=0, si_stime=0} ---
openat(AT_FDCWD, "/dev/nvidia-uvm", O_RDWR|O_CLOEXEC) = -1 EIO (Input/output error) <==== this line
openat(AT_FDCWD, "/dev/nvidia-uvm", O_RDWR) = -1 EIO (Input/output error)
ioctl(-5, _IOC(_IOC_NONE, 0, 0x2, 0x3000), 0) = -1 EBADF (Bad file descriptor)
ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x29, 0x10), 0x7fffbc2458c0) = 0
close(3) = 0
munmap(0x7f60449e4000, 30281056) = 0
futex(0x7f6046ea10f0, FUTEX_WAKE_PRIVATE, 2147483647) = 0
newfstatat(1, "", {st_mode=S_IFCHR|0620, st_rdev=makedev(0x88, 0x1), ...}, AT_EMPTY_PATH) = 0
write(1, "\33[31m\n", 6
) = 6
write(1, "MCX ERROR(-999):unknown error in"..., 55MCX ERROR(-999):unknown error in unit mcx_core.cu:2352
) = 55
write(1, "\33[0m", 4) = 4
exit_group(-999) = ?
+++ exited with 25 +++
when the GPU is usable, usually /dev/char/504:0
exists and can be read normally.
my OpenCL listing device returned similar error messages (/dev/char/504:0
disappears and /dev/nvidia-uvm
IO error) when the GPU is not accessible.
fangq@rainbird$ lsmod | grep nvidia
nvidia_uvm 1363968 2
nvidia_drm 69632 28
nvidia_modeset 1241088 19 nvidia_drm
nvidia 56311808 1112 nvidia_uvm,nvidia_modeset
drm_kms_helper 311296 4 amdgpu,nvidia_drm
drm 622592 20 drm_kms_helper,amd_sched,amdttm,nvidia,amdgpu,nvidia_drm,amddrm_ttm_helper
i2c_nvidia_gpu 16384 0
##################################
fangq@rainbird$ ls -lt /dev/nvidia-uvm
crw-rw-rw- 1 root root 504, 0 Aug 2 00:57 /dev/nvidia-uvm
##################################
fangq@rainbird$ uname -a
Linux rainbird 5.15.0-78-generic #85-Ubuntu SMP Fri Jul 7 15:25:09 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
##################################
fangq@rainbird$ nvidia-smi
Fri Aug 4 21:26:17 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.125.06 Driver Version: 525.125.06 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:01:00.0 On | N/A |
| N/A 53C P0 22W / 90W | 823MiB / 6144MiB | 25% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1577 G /usr/lib/xorg/Xorg 589MiB |
...