Linux version: 5.10.15-gentoo
Driver version: 460.39
GPU: GeForce GTX 750 Ti
Trying to run simple test program calling clGetPlatformIDs or cuInit just failing in most cases.
Running the same test compiled with gcc -m32 always work and it looks like 32bit don’t even use UVM. If I lucky enough I can run the 64bit test several times in a row, but it always failing at some point and never restores. Then I should reload nvidia-uvm module and try again, but the test can fail even on the first run.
I did caught one success launch with strace and here where the difference start:
ok:
ioctl(6, _IOC(_IOC_NONE, 0, 0x25, 0), 0x7ffdff753a60) = 0
ioctl(6, _IOC(_IOC_NONE, 0, 0x17, 0), 0x7ffdff753ae0) = 0
...
ko:
ioctl(6, _IOC(_IOC_NONE, 0, 0x25, 0), 0x7ffec0ca2390) = 0
ioctl(6, _IOC(_IOC_NONE, 0, 0x18, 0), 0x7ffec0ca2400) = 0
munmap(0x7f590187a000, 659456) = 0
...
where 6 is:
openat(AT_FDCWD, "/dev/nvidia-uvm", O_RDWR|O_CLOEXEC) = 6
Then on success it loads /usr/lib64/libcuda.so.1 and proceeded, on fail it starts closing everything.
Enabling # modprobe nvidia_uvm uvm_debug_prints=1 uvm_enable_builtin_tests=1 uvm_debug_enable_push_desc=1 reveals the following lines in dmesg (cut):
uvm_channel.c:461 uvm_channel_check_errors[pid:3385] Detected a channel error, channel ID 2 (0x2) CE 0 GPU ID 1
uvm_global.c:410 uvm_global_set_fatal_error_impl[pid:3385] Encountered a global fatal error: Generic RC error [NV_ERR_RC_ERROR]
uvm_channel.c:652 init_channel[pid:3385] Channel init failed: Generic RC error [NV_ERR_RC_ERROR], GPU ID 1
uvm_gpu.c:1130 init_gpu[pid:3385] Failed to initialize the channel manager: Generic RC error [NV_ERR_RC_ERROR], GPU ID 1
Sometimes there is also additional line after the first error:
uvm_channel.c:468 uvm_channel_check_errors[pid:3385] Channel error likely caused by push 'Init channel' started at uvm_channel.c:641 in init_channel()
channel ID in most cases is 2, sometimes 3 but I never saw number less then 2. So I suspect those success runs was with channel ID 1 or so.
I tried hard to figure out if this is setup problem, changed different kernel options to ensure CONFIG_NUMA=y is set and so on, but no luck so far. The behavior always the same.
Apologies for my English.