/dev/nvidia-uvm IO error on Ubuntu 22.04, 520 to 535 driver versions

This is the same observation as I posted in another thread on disappearing OpenCL icd on Ubuntu 22.04

after some extended use on a Ubuntu 22.04 box, I noticed that CUDA programs also fails to find NVIDIA GPUs when this happens. Similar to my observation for my OpenCL code, my CUDA code also fails to list NVIDIA GPU and get a -999 unknown error from CUDA.

I’ve been observing this behavior for a few months now. After a fresh reboot, usually the NVIDIA GPU can function correctly for a few days under both OpenCL and CUDA. However, after a few suspend+wakeup cycles, some times after 1-2 days, some times after 3-4 days, the NVIDIA GPU disappears from both CUDA and OpenCL (including clinfo output), despite that nvidia-smi can still list the device.

I observed this on RTX 2060 and RTX 4090, both on Ubuntu 22.04. Same driver version (52x or 53x) works fine on older versions of Ubuntu (20.04 and 18.04). So, it seems something in Ubuntu 22.04 or kernel 5.15 has some issue with NVIDIA drivers.

Using strace, I captured the system calls when the GPU is working vs when it broke, and I found that the difference is that when the GPU disappears, reading-writing char-device /dev/char/504:0, which is a symbolic link to /dev/nvidia-uvm gives a “ENOENT (No such file or directory)” error, and reading-writing /dev/nvidia-uvm directly gives a EIO (Input/output error) error.

here is a snippet of the log printed when trying to list CUDA devices using my program mcx

strace ./mcx -L
....
openat(AT_FDCWD, "/proc/devices", O_RDONLY) = 4
newfstatat(4, "", {st_mode=S_IFREG|0444, st_size=0, ...}, AT_EMPTY_PATH) = 0
read(4, "Character devices:\n  1 mem\n  4 /"..., 1024) = 864
close(4)                                = 0
stat("/dev/nvidia-uvm", {st_mode=S_IFCHR|0666, st_rdev=makedev(0x1f8, 0), ...}) = 0
stat("/dev/nvidia-uvm", {st_mode=S_IFCHR|0666, st_rdev=makedev(0x1f8, 0), ...}) = 0
unlink("/dev/char/504:0")               = -1 ENOENT (No such file or directory)
symlink("../nvidia-uvm", "/dev/char/504:0") = -1 EACCES (Permission denied)     <==== this line
stat("/dev/char/504:0", 0x7fffbc245760) = -1 ENOENT (No such file or directory) <==== this line
stat("/usr/bin/nvidia-modprobe", {st_mode=S_IFREG|S_ISUID|S_ISGID|0755, st_size=47320, ...}) = 0
geteuid()                               = 1000
clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f6046eee2d0) = 194037
wait4(194037, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 194037
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=194037, si_uid=1000, si_status=0, si_utime=0, si_stime=0} ---
openat(AT_FDCWD, "/dev/nvidia-uvm", O_RDWR|O_CLOEXEC) = -1 EIO (Input/output error)  <==== this line
openat(AT_FDCWD, "/dev/nvidia-uvm", O_RDWR) = -1 EIO (Input/output error)
ioctl(-5, _IOC(_IOC_NONE, 0, 0x2, 0x3000), 0) = -1 EBADF (Bad file descriptor)
ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x29, 0x10), 0x7fffbc2458c0) = 0
close(3)                                = 0
munmap(0x7f60449e4000, 30281056)        = 0
futex(0x7f6046ea10f0, FUTEX_WAKE_PRIVATE, 2147483647) = 0
newfstatat(1, "", {st_mode=S_IFCHR|0620, st_rdev=makedev(0x88, 0x1), ...}, AT_EMPTY_PATH) = 0
write(1, "\33[31m\n", 6
)                = 6
write(1, "MCX ERROR(-999):unknown error in"..., 55MCX ERROR(-999):unknown error in unit mcx_core.cu:2352
) = 55
write(1, "\33[0m", 4)                   = 4
exit_group(-999)                        = ?
+++ exited with 25 +++

when the GPU is usable, usually /dev/char/504:0 exists and can be read normally.

my OpenCL listing device returned similar error messages (/dev/char/504:0 disappears and /dev/nvidia-uvm IO error) when the GPU is not accessible.

fangq@rainbird$ lsmod | grep nvidia
nvidia_uvm           1363968  2
nvidia_drm             69632  28
nvidia_modeset       1241088  19 nvidia_drm
nvidia              56311808  1112 nvidia_uvm,nvidia_modeset
drm_kms_helper        311296  4 amdgpu,nvidia_drm
drm                   622592  20 drm_kms_helper,amd_sched,amdttm,nvidia,amdgpu,nvidia_drm,amddrm_ttm_helper
i2c_nvidia_gpu         16384  0
##################################
fangq@rainbird$ ls -lt /dev/nvidia-uvm
crw-rw-rw- 1 root root 504, 0 Aug  2 00:57 /dev/nvidia-uvm
##################################
fangq@rainbird$ uname -a
Linux rainbird 5.15.0-78-generic #85-Ubuntu SMP Fri Jul 7 15:25:09 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
##################################
fangq@rainbird$ nvidia-smi
Fri Aug  4 21:26:17 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.125.06   Driver Version: 525.125.06   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0  On |                  N/A |
| N/A   53C    P0    22W /  90W |    823MiB /  6144MiB |     25%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1577      G   /usr/lib/xorg/Xorg                589MiB |
...

through google, I found a similar report, although, in my case, nvidia-uvm kernel module can be listed and the below command has no error.

sudo modprobe nvidia-uvm

here is the modinfo output

fangq@rainbird$ sudo modinfo nvidia-uvm
filename:       /lib/modules/5.15.0-78-generic/updates/dkms/nvidia-uvm.ko
version:        525.125.06
supported:      external
license:        Dual MIT/GPL
srcversion:     00103E43435579A9611F585
depends:        nvidia
retpoline:      Y
name:           nvidia_uvm
vermagic:       5.15.0-78-generic SMP mod_unload modversions 
sig_id:         PKCS#7
signer:         invocation Secure Boot Module Signature key
sig_key:        15:10:5F:46:1B:A6:D8:30:36:D0:FD:0D:64:A5:AF:A4:7C:13:7F:08
sig_hashalgo:   sha512
signature:      00:7D:ED:BC:15:08:30:BC:93:C9:6C:59:92:E2:52:10:43:08:75:5F:
		33:95:88:1B:B1:06:E6:26:4B:6C:75:77:4B:AE:26:A5:BC:28:DA:80:
		57:AC:EA:3F:9B:A5:67:FC:4C:74:43:7B:36:83:AB:62:E9:56:06:47:
		E5:69:C2:6D:69:40:2B:AC:8C:C2:7A:AF:D2:9F:63:7D:E4:96:C6:F1:
		E6:4F:BB:92:5A:88:A0:AC:70:FD:50:24:BA:DA:90:1D:98:66:A1:C4:
		57:64:9F:BA:30:4A:DF:57:C2:48:78:E9:F2:AC:A4:FB:A9:88:C3:2D:
		4D:76:70:6E:5A:3E:60:DF:C9:C9:79:0C:66:98:B6:97:A9:44:BD:FE:
		A7:97:85:2E:10:E7:E9:38:4B:C8:AD:EC:58:91:5C:72:E1:26:6E:C0:
		00:89:66:14:D3:C9:6D:8D:45:8F:97:65:A7:86:8C:07:D6:4A:FC:C5:
		BA:B8:9B:CF:17:54:28:6D:B0:28:29:D6:49:B2:83:79:77:FE:DC:FA:
		16:5D:4B:76:17:C3:53:F2:64:F1:D2:7F:03:A6:DF:7C:57:06:30:85:
		F0:A0:F9:48:95:E4:74:4A:63:11:D4:E9:62:EB:F2:6E:FD:20:6B:D3:
		BD:17:50:9E:B0:48:77:84:E2:A9:A8:2E:BF:E3:80:9C
parm:           uvm_exp_perf_prefetch_ats_order_replayable:Max order of pages (2^N) to prefetch on replayable ATS faults (uint)
parm:           uvm_exp_perf_prefetch_ats_order_non_replayable:Max order of pages (2^N) to prefetch on non-replayable ATS faults (uint)
parm:           uvm_ats_mode:Set to 0 to disable ATS (Address Translation Services). Any other value is ignored. Has no effect unless the platform supports ATS. (int)
parm:           uvm_perf_prefetch_enable:uint
parm:           uvm_perf_prefetch_threshold:uint
parm:           uvm_perf_prefetch_min_faults:uint
parm:           uvm_perf_thrashing_enable:uint
parm:           uvm_perf_thrashing_threshold:uint
parm:           uvm_perf_thrashing_pin_threshold:uint
parm:           uvm_perf_thrashing_lapse_usec:uint
parm:           uvm_perf_thrashing_nap:uint
parm:           uvm_perf_thrashing_epoch:uint
parm:           uvm_perf_thrashing_pin:uint
parm:           uvm_perf_thrashing_max_resets:uint
parm:           uvm_perf_map_remote_on_native_atomics_fault:uint
parm:           uvm_disable_hmm:Force-disable HMM functionality in the UVM driver. Default: false (i.e, HMM is potentially enabled). Ignored if HMM is not supported in the driver, or if ATS settings conflict with HMM. (bool)
parm:           uvm_perf_migrate_cpu_preunmap_enable:int
parm:           uvm_perf_migrate_cpu_preunmap_block_order:uint
parm:           uvm_global_oversubscription:Enable (1) or disable (0) global oversubscription support. (int)
parm:           uvm_perf_pma_batch_nonpinned_order:uint
parm:           uvm_cpu_chunk_allocation_sizes:OR'ed value of all CPU chunk allocation sizes. (uint)
parm:           uvm_leak_checker:Enable uvm memory leak checking. 0 = disabled, 1 = count total bytes allocated and freed, 2 = per-allocation origin tracking. (int)
parm:           uvm_force_prefetch_fault_support:uint
parm:           uvm_debug_enable_push_desc:Enable push description tracking (uint)
parm:           uvm_debug_enable_push_acquire_info:Enable push acquire information tracking (uint)
parm:           uvm_page_table_location:Set the location for UVM-allocated page tables. Choices are: vid, sys. (charp)
parm:           uvm_perf_access_counter_mimc_migration_enable:Whether MIMC access counters will trigger migrations.Valid values: <= -1 (default policy), 0 (off), >= 1 (on) (int)
parm:           uvm_perf_access_counter_momc_migration_enable:Whether MOMC access counters will trigger migrations.Valid values: <= -1 (default policy), 0 (off), >= 1 (on) (int)
parm:           uvm_perf_access_counter_batch_count:uint
parm:           uvm_perf_access_counter_granularity:Size of the physical memory region tracked by each counter. Valid values asof Volta: 64k, 2m, 16m, 16g (charp)
parm:           uvm_perf_access_counter_threshold:Number of remote accesses on a region required to trigger a notification.Valid values: [1, 65535] (uint)
parm:           uvm_perf_reenable_prefetch_faults_lapse_msec:uint
parm:           uvm_perf_fault_batch_count:uint
parm:           uvm_perf_fault_replay_policy:uint
parm:           uvm_perf_fault_replay_update_put_ratio:uint
parm:           uvm_perf_fault_max_batches_per_service:uint
parm:           uvm_perf_fault_max_throttle_per_service:uint
parm:           uvm_perf_fault_coalesce:uint
parm:           uvm_fault_force_sysmem:Force (1) using sysmem storage for pages that faulted. Default: 0. (int)
parm:           uvm_perf_map_remote_on_eviction:int
parm:           uvm_exp_gpu_cache_peermem:Force caching for mappings to peer memory. This is an experimental parameter that may cause correctness issues if used. (uint)
parm:           uvm_exp_gpu_cache_sysmem:Force caching for mappings to system memory. This is an experimental parameter that may cause correctness issues if used. (uint)
parm:           uvm_channel_num_gpfifo_entries:uint
parm:           uvm_channel_gpfifo_loc:charp
parm:           uvm_channel_gpput_loc:charp
parm:           uvm_channel_pushbuffer_loc:charp
parm:           uvm_enable_va_space_mm:Set to 0 to disable UVM from using mmu_notifiers to create an association between a UVM VA space and a process. This will also disable pageable memory access via either ATS or HMM. (int)
parm:           uvm_enable_debug_procfs:Enable debug procfs entries in /proc/driver/nvidia-uvm (int)
parm:           uvm_peer_copy:Choose the addressing mode for peer copying, options: phys [default] or virt. Valid for Ampere+ GPUs. (charp)
parm:           uvm_debug_prints:Enable uvm debug prints. (int)
parm:           uvm_enable_builtin_tests:Enable the UVM built-in tests. (This is a security risk) (int)
parm:           uvm_release_asserts:Enable uvm asserts included in release builds. (int)
parm:           uvm_release_asserts_dump_stack:dump_stack() on failed UVM release asserts. (int)
parm:           uvm_release_asserts_set_global_error:Set UVM global fatal error on failed release asserts. (int)

it looks like many people had reported similar issue before - for driver as old as 460 - seems to be a known bug of nvidia-uvm, are there workarounds? this happens quite frequently now, and I really hate to reboot my machine.