Cuda or OpenCL 32bit - OK, 64bit - KO. Why? (460.39, nvidia-uvm)

Linux version: 5.10.15-gentoo
Driver version: 460.39
GPU: GeForce GTX 750 Ti

Trying to run simple test program calling clGetPlatformIDs or cuInit just failing in most cases.

Running the same test compiled with gcc -m32 always work and it looks like 32bit don’t even use UVM. If I lucky enough I can run the 64bit test several times in a row, but it always failing at some point and never restores. Then I should reload nvidia-uvm module and try again, but the test can fail even on the first run.

I did caught one success launch with strace and here where the difference start:


ioctl(6, _IOC(_IOC_NONE, 0, 0x25, 0), 0x7ffdff753a60) = 0
ioctl(6, _IOC(_IOC_NONE, 0, 0x17, 0), 0x7ffdff753ae0) = 0


ioctl(6, _IOC(_IOC_NONE, 0, 0x25, 0), 0x7ffec0ca2390) = 0
ioctl(6, _IOC(_IOC_NONE, 0, 0x18, 0), 0x7ffec0ca2400) = 0
munmap(0x7f590187a000, 659456)          = 0

where 6 is:

openat(AT_FDCWD, "/dev/nvidia-uvm", O_RDWR|O_CLOEXEC) = 6

Then on success it loads /usr/lib64/ and proceeded, on fail it starts closing everything.

Enabling # modprobe nvidia_uvm uvm_debug_prints=1 uvm_enable_builtin_tests=1 uvm_debug_enable_push_desc=1 reveals the following lines in dmesg (cut):

uvm_channel.c:461 uvm_channel_check_errors[pid:3385] Detected a channel error, channel ID 2 (0x2) CE 0 GPU ID 1
uvm_global.c:410 uvm_global_set_fatal_error_impl[pid:3385] Encountered a global fatal error: Generic RC error [NV_ERR_RC_ERROR]
uvm_channel.c:652 init_channel[pid:3385] Channel init failed: Generic RC error [NV_ERR_RC_ERROR], GPU ID 1
uvm_gpu.c:1130 init_gpu[pid:3385] Failed to initialize the channel manager: Generic RC error [NV_ERR_RC_ERROR], GPU ID 1

Sometimes there is also additional line after the first error:

uvm_channel.c:468 uvm_channel_check_errors[pid:3385] Channel error likely caused by push 'Init channel' started at uvm_channel.c:641 in init_channel()

channel ID in most cases is 2, sometimes 3 but I never saw number less then 2. So I suspect those success runs was with channel ID 1 or so.

I tried hard to figure out if this is setup problem, changed different kernel options to ensure CONFIG_NUMA=y is set and so on, but no luck so far. The behavior always the same.

Apologies for my English.

Already tried a 5.4 kernel?

No, but why not. Currently tested with 5.10.{14,15}, but I guess the problem was there for a while. I didn’t run cuda/CL programs so often.
Thanks for suggestion, I’ll try 5.4 kernel.

Okay, puzzle solved.
InitializeSystemMemoryAllocations=0 nvidia module option breaks nvidia-uvm module.

 * Option: InitializeSystemMemoryAllocations
 * Description:
 * The NVIDIA Linux driver normally clears system memory it allocates
 * for use with GPUs or within the driver stack. This is to ensure
 * that potentially sensitive data is not rendered accessible by
 * arbitrary user applications.
 * Owners of single-user systems or similar trusted configurations may
 * choose to disable the aforementioned clears using this option and
 * potentially improve performance.
 * Possible values:
 *  1 = zero out system memory allocations (default)
 *  0 = do not perform memory clears

5.4 and 5.10 kernels both KO with this and both OK when reset to default.