/dev/nvidia-uvm IO error on Ubuntu 22.04, 520 to 535 driver versions

FangQ · August 5, 2023, 2:27am

This is the same observation as I posted in another thread on disappearing OpenCL icd on Ubuntu 22.04

after some extended use on a Ubuntu 22.04 box, I noticed that CUDA programs also fails to find NVIDIA GPUs when this happens. Similar to my observation for my OpenCL code, my CUDA code also fails to list NVIDIA GPU and get a -999 unknown error from CUDA.

I’ve been observing this behavior for a few months now. After a fresh reboot, usually the NVIDIA GPU can function correctly for a few days under both OpenCL and CUDA. However, after a few suspend+wakeup cycles, some times after 1-2 days, some times after 3-4 days, the NVIDIA GPU disappears from both CUDA and OpenCL (including clinfo output), despite that nvidia-smi can still list the device.

I observed this on RTX 2060 and RTX 4090, both on Ubuntu 22.04. Same driver version (52x or 53x) works fine on older versions of Ubuntu (20.04 and 18.04). So, it seems something in Ubuntu 22.04 or kernel 5.15 has some issue with NVIDIA drivers.

Using strace, I captured the system calls when the GPU is working vs when it broke, and I found that the difference is that when the GPU disappears, reading-writing char-device /dev/char/504:0, which is a symbolic link to /dev/nvidia-uvm gives a “ENOENT (No such file or directory)” error, and reading-writing /dev/nvidia-uvm directly gives a EIO (Input/output error) error.

here is a snippet of the log printed when trying to list CUDA devices using my program mcx

strace ./mcx -L

....
openat(AT_FDCWD, "/proc/devices", O_RDONLY) = 4
newfstatat(4, "", {st_mode=S_IFREG|0444, st_size=0, ...}, AT_EMPTY_PATH) = 0
read(4, "Character devices:\n  1 mem\n  4 /"..., 1024) = 864
close(4)                                = 0
stat("/dev/nvidia-uvm", {st_mode=S_IFCHR|0666, st_rdev=makedev(0x1f8, 0), ...}) = 0
stat("/dev/nvidia-uvm", {st_mode=S_IFCHR|0666, st_rdev=makedev(0x1f8, 0), ...}) = 0
unlink("/dev/char/504:0")               = -1 ENOENT (No such file or directory)
symlink("../nvidia-uvm", "/dev/char/504:0") = -1 EACCES (Permission denied)     <==== this line
stat("/dev/char/504:0", 0x7fffbc245760) = -1 ENOENT (No such file or directory) <==== this line
stat("/usr/bin/nvidia-modprobe", {st_mode=S_IFREG|S_ISUID|S_ISGID|0755, st_size=47320, ...}) = 0
geteuid()                               = 1000
clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f6046eee2d0) = 194037
wait4(194037, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 194037
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=194037, si_uid=1000, si_status=0, si_utime=0, si_stime=0} ---
openat(AT_FDCWD, "/dev/nvidia-uvm", O_RDWR|O_CLOEXEC) = -1 EIO (Input/output error)  <==== this line
openat(AT_FDCWD, "/dev/nvidia-uvm", O_RDWR) = -1 EIO (Input/output error)
ioctl(-5, _IOC(_IOC_NONE, 0, 0x2, 0x3000), 0) = -1 EBADF (Bad file descriptor)
ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x29, 0x10), 0x7fffbc2458c0) = 0
close(3)                                = 0
munmap(0x7f60449e4000, 30281056)        = 0
futex(0x7f6046ea10f0, FUTEX_WAKE_PRIVATE, 2147483647) = 0
newfstatat(1, "", {st_mode=S_IFCHR|0620, st_rdev=makedev(0x88, 0x1), ...}, AT_EMPTY_PATH) = 0
write(1, "\33[31m\n", 6
)                = 6
write(1, "MCX ERROR(-999):unknown error in"..., 55MCX ERROR(-999):unknown error in unit mcx_core.cu:2352
) = 55
write(1, "\33[0m", 4)                   = 4
exit_group(-999)                        = ?
+++ exited with 25 +++

when the GPU is usable, usually /dev/char/504:0 exists and can be read normally.

my OpenCL listing device returned similar error messages (/dev/char/504:0 disappears and /dev/nvidia-uvm IO error) when the GPU is not accessible.

fangq@rainbird$ lsmod | grep nvidia
nvidia_uvm           1363968  2
nvidia_drm             69632  28
nvidia_modeset       1241088  19 nvidia_drm
nvidia              56311808  1112 nvidia_uvm,nvidia_modeset
drm_kms_helper        311296  4 amdgpu,nvidia_drm
drm                   622592  20 drm_kms_helper,amd_sched,amdttm,nvidia,amdgpu,nvidia_drm,amddrm_ttm_helper
i2c_nvidia_gpu         16384  0
##################################
fangq@rainbird$ ls -lt /dev/nvidia-uvm
crw-rw-rw- 1 root root 504, 0 Aug  2 00:57 /dev/nvidia-uvm
##################################
fangq@rainbird$ uname -a
Linux rainbird 5.15.0-78-generic #85-Ubuntu SMP Fri Jul 7 15:25:09 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
##################################
fangq@rainbird$ nvidia-smi
Fri Aug  4 21:26:17 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.125.06   Driver Version: 525.125.06   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0  On |                  N/A |
| N/A   53C    P0    22W /  90W |    823MiB /  6144MiB |     25%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1577      G   /usr/lib/xorg/Xorg                589MiB |
...

FangQ · August 5, 2023, 2:29am

through google, I found a similar report, although, in my case, nvidia-uvm kernel module can be listed and the below command has no error.

sudo modprobe nvidia-uvm

here is the modinfo output

fangq@rainbird$ sudo modinfo nvidia-uvm
filename:       /lib/modules/5.15.0-78-generic/updates/dkms/nvidia-uvm.ko
version:        525.125.06
supported:      external
license:        Dual MIT/GPL
srcversion:     00103E43435579A9611F585
depends:        nvidia
retpoline:      Y
name:           nvidia_uvm
vermagic:       5.15.0-78-generic SMP mod_unload modversions 
sig_id:         PKCS#7
signer:         invocation Secure Boot Module Signature key
sig_key:        15:10:5F:46:1B:A6:D8:30:36:D0:FD:0D:64:A5:AF:A4:7C:13:7F:08
sig_hashalgo:   sha512
signature:      00:7D:ED:BC:15:08:30:BC:93:C9:6C:59:92:E2:52:10:43:08:75:5F:
		33:95:88:1B:B1:06:E6:26:4B:6C:75:77:4B:AE:26:A5:BC:28:DA:80:
		57:AC:EA:3F:9B:A5:67:FC:4C:74:43:7B:36:83:AB:62:E9:56:06:47:
		E5:69:C2:6D:69:40:2B:AC:8C:C2:7A:AF:D2:9F:63:7D:E4:96:C6:F1:
		E6:4F:BB:92:5A:88:A0:AC:70:FD:50:24:BA:DA:90:1D:98:66:A1:C4:
		57:64:9F:BA:30:4A:DF:57:C2:48:78:E9:F2:AC:A4:FB:A9:88:C3:2D:
		4D:76:70:6E:5A:3E:60:DF:C9:C9:79:0C:66:98:B6:97:A9:44:BD:FE:
		A7:97:85:2E:10:E7:E9:38:4B:C8:AD:EC:58:91:5C:72:E1:26:6E:C0:
		00:89:66:14:D3:C9:6D:8D:45:8F:97:65:A7:86:8C:07:D6:4A:FC:C5:
		BA:B8:9B:CF:17:54:28:6D:B0:28:29:D6:49:B2:83:79:77:FE:DC:FA:
		16:5D:4B:76:17:C3:53:F2:64:F1:D2:7F:03:A6:DF:7C:57:06:30:85:
		F0:A0:F9:48:95:E4:74:4A:63:11:D4:E9:62:EB:F2:6E:FD:20:6B:D3:
		BD:17:50:9E:B0:48:77:84:E2:A9:A8:2E:BF:E3:80:9C
parm:           uvm_exp_perf_prefetch_ats_order_replayable:Max order of pages (2^N) to prefetch on replayable ATS faults (uint)
parm:           uvm_exp_perf_prefetch_ats_order_non_replayable:Max order of pages (2^N) to prefetch on non-replayable ATS faults (uint)
parm:           uvm_ats_mode:Set to 0 to disable ATS (Address Translation Services). Any other value is ignored. Has no effect unless the platform supports ATS. (int)
parm:           uvm_perf_prefetch_enable:uint
parm:           uvm_perf_prefetch_threshold:uint
parm:           uvm_perf_prefetch_min_faults:uint
parm:           uvm_perf_thrashing_enable:uint
parm:           uvm_perf_thrashing_threshold:uint
parm:           uvm_perf_thrashing_pin_threshold:uint
parm:           uvm_perf_thrashing_lapse_usec:uint
parm:           uvm_perf_thrashing_nap:uint
parm:           uvm_perf_thrashing_epoch:uint
parm:           uvm_perf_thrashing_pin:uint
parm:           uvm_perf_thrashing_max_resets:uint
parm:           uvm_perf_map_remote_on_native_atomics_fault:uint
parm:           uvm_disable_hmm:Force-disable HMM functionality in the UVM driver. Default: false (i.e, HMM is potentially enabled). Ignored if HMM is not supported in the driver, or if ATS settings conflict with HMM. (bool)
parm:           uvm_perf_migrate_cpu_preunmap_enable:int
parm:           uvm_perf_migrate_cpu_preunmap_block_order:uint
parm:           uvm_global_oversubscription:Enable (1) or disable (0) global oversubscription support. (int)
parm:           uvm_perf_pma_batch_nonpinned_order:uint
parm:           uvm_cpu_chunk_allocation_sizes:OR'ed value of all CPU chunk allocation sizes. (uint)
parm:           uvm_leak_checker:Enable uvm memory leak checking. 0 = disabled, 1 = count total bytes allocated and freed, 2 = per-allocation origin tracking. (int)
parm:           uvm_force_prefetch_fault_support:uint
parm:           uvm_debug_enable_push_desc:Enable push description tracking (uint)
parm:           uvm_debug_enable_push_acquire_info:Enable push acquire information tracking (uint)
parm:           uvm_page_table_location:Set the location for UVM-allocated page tables. Choices are: vid, sys. (charp)
parm:           uvm_perf_access_counter_mimc_migration_enable:Whether MIMC access counters will trigger migrations.Valid values: <= -1 (default policy), 0 (off), >= 1 (on) (int)
parm:           uvm_perf_access_counter_momc_migration_enable:Whether MOMC access counters will trigger migrations.Valid values: <= -1 (default policy), 0 (off), >= 1 (on) (int)
parm:           uvm_perf_access_counter_batch_count:uint
parm:           uvm_perf_access_counter_granularity:Size of the physical memory region tracked by each counter. Valid values asof Volta: 64k, 2m, 16m, 16g (charp)
parm:           uvm_perf_access_counter_threshold:Number of remote accesses on a region required to trigger a notification.Valid values: [1, 65535] (uint)
parm:           uvm_perf_reenable_prefetch_faults_lapse_msec:uint
parm:           uvm_perf_fault_batch_count:uint
parm:           uvm_perf_fault_replay_policy:uint
parm:           uvm_perf_fault_replay_update_put_ratio:uint
parm:           uvm_perf_fault_max_batches_per_service:uint
parm:           uvm_perf_fault_max_throttle_per_service:uint
parm:           uvm_perf_fault_coalesce:uint
parm:           uvm_fault_force_sysmem:Force (1) using sysmem storage for pages that faulted. Default: 0. (int)
parm:           uvm_perf_map_remote_on_eviction:int
parm:           uvm_exp_gpu_cache_peermem:Force caching for mappings to peer memory. This is an experimental parameter that may cause correctness issues if used. (uint)
parm:           uvm_exp_gpu_cache_sysmem:Force caching for mappings to system memory. This is an experimental parameter that may cause correctness issues if used. (uint)
parm:           uvm_channel_num_gpfifo_entries:uint
parm:           uvm_channel_gpfifo_loc:charp
parm:           uvm_channel_gpput_loc:charp
parm:           uvm_channel_pushbuffer_loc:charp
parm:           uvm_enable_va_space_mm:Set to 0 to disable UVM from using mmu_notifiers to create an association between a UVM VA space and a process. This will also disable pageable memory access via either ATS or HMM. (int)
parm:           uvm_enable_debug_procfs:Enable debug procfs entries in /proc/driver/nvidia-uvm (int)
parm:           uvm_peer_copy:Choose the addressing mode for peer copying, options: phys [default] or virt. Valid for Ampere+ GPUs. (charp)
parm:           uvm_debug_prints:Enable uvm debug prints. (int)
parm:           uvm_enable_builtin_tests:Enable the UVM built-in tests. (This is a security risk) (int)
parm:           uvm_release_asserts:Enable uvm asserts included in release builds. (int)
parm:           uvm_release_asserts_dump_stack:dump_stack() on failed UVM release asserts. (int)
parm:           uvm_release_asserts_set_global_error:Set UVM global fatal error on failed release asserts. (int)

FangQ · August 27, 2023, 2:28am

it looks like many people had reported similar issue before - for driver as old as 460 - seems to be a known bug of nvidia-uvm, are there workarounds? this happens quite frequently now, and I really hate to reboot my machine.

Topic		Replies	Views
BUG: nvidia_uvm needs to be removed and re-inserted in order to work after wakeup from suspend Linux driver	23	7148	May 23, 2025
"unknown error" from CUDA 11.7 (Ubuntu 22.04 64bit) Linux cuda	16	2828	August 5, 2022
Nvidia-uvm module bug on suspend Linux	14	1791	December 7, 2023
Nvidia process not running Linux	25	2865	December 31, 2021
Broken GPU state query failure in AMD + H100 Confidential Computing	10	1101	February 15, 2024
340.106 nvidia-uvm.ko fails to build under kernel 4.14.y Linux	16	7344	October 14, 2021
Nvidia-drm Failed to map when waking up on Ubuntu 23.10 GPU - Hardware ubuntu	8	1245	January 10, 2024
Internode nvshmme and ib problem GPU-Accelerated Libraries nvshmem	20	1313	April 24, 2024
Cuda broken in 396.24.02 and 396.24.10 Vulkan beta drivers on Linux Linux	47	9040	October 14, 2021
Always got this warning when nvprof cuda file "This can happen if device ran out of memory or if a device kernel was stopped due to an assertion" on just HellowWorld GPU CUDA Programming and Performance	9	2568	January 31, 2019

/dev/nvidia-uvm IO error on Ubuntu 22.04, 520 to 535 driver versions

Related topics