Linux version: 5.10.15-gentoo
Driver version: 460.39
GPU: GeForce GTX 750 Ti
Trying to run simple test program calling clGetPlatformIDs
or cuInit
just failing in most cases.
Running the same test compiled with gcc -m32
always work and it looks like 32bit don’t even use UVM. If I lucky enough I can run the 64bit test several times in a row, but it always failing at some point and never restores. Then I should reload nvidia-uvm
module and try again, but the test can fail even on the first run.
I did caught one success launch with strace
and here where the difference start:
ok:
ioctl(6, _IOC(_IOC_NONE, 0, 0x25, 0), 0x7ffdff753a60) = 0
ioctl(6, _IOC(_IOC_NONE, 0, 0x17, 0), 0x7ffdff753ae0) = 0
...
ko:
ioctl(6, _IOC(_IOC_NONE, 0, 0x25, 0), 0x7ffec0ca2390) = 0
ioctl(6, _IOC(_IOC_NONE, 0, 0x18, 0), 0x7ffec0ca2400) = 0
munmap(0x7f590187a000, 659456) = 0
...
where 6
is:
openat(AT_FDCWD, "/dev/nvidia-uvm", O_RDWR|O_CLOEXEC) = 6
Then on success it loads /usr/lib64/libcuda.so.1
and proceeded, on fail it starts closing everything.
Enabling # modprobe nvidia_uvm uvm_debug_prints=1 uvm_enable_builtin_tests=1 uvm_debug_enable_push_desc=1
reveals the following lines in dmesg
(cut):
uvm_channel.c:461 uvm_channel_check_errors[pid:3385] Detected a channel error, channel ID 2 (0x2) CE 0 GPU ID 1
uvm_global.c:410 uvm_global_set_fatal_error_impl[pid:3385] Encountered a global fatal error: Generic RC error [NV_ERR_RC_ERROR]
uvm_channel.c:652 init_channel[pid:3385] Channel init failed: Generic RC error [NV_ERR_RC_ERROR], GPU ID 1
uvm_gpu.c:1130 init_gpu[pid:3385] Failed to initialize the channel manager: Generic RC error [NV_ERR_RC_ERROR], GPU ID 1
Sometimes there is also additional line after the first error:
uvm_channel.c:468 uvm_channel_check_errors[pid:3385] Channel error likely caused by push 'Init channel' started at uvm_channel.c:641 in init_channel()
channel ID
in most cases is 2
, sometimes 3
but I never saw number less then 2
. So I suspect those success runs was with channel ID 1
or so.
I tried hard to figure out if this is setup problem, changed different kernel options to ensure CONFIG_NUMA=y
is set and so on, but no luck so far. The behavior always the same.
Apologies for my English.