Cuda or OpenCL 32bit - OK, 64bit - KO. Why? (460.39, nvidia-uvm)

ChinaphoneOne · February 11, 2021, 3:53pm

Linux version: 5.10.15-gentoo
Driver version: 460.39
GPU: GeForce GTX 750 Ti

Trying to run simple test program calling clGetPlatformIDs or cuInit just failing in most cases.

Running the same test compiled with gcc -m32 always work and it looks like 32bit don’t even use UVM. If I lucky enough I can run the 64bit test several times in a row, but it always failing at some point and never restores. Then I should reload nvidia-uvm module and try again, but the test can fail even on the first run.

I did caught one success launch with strace and here where the difference start:

ok:

ioctl(6, _IOC(_IOC_NONE, 0, 0x25, 0), 0x7ffdff753a60) = 0
ioctl(6, _IOC(_IOC_NONE, 0, 0x17, 0), 0x7ffdff753ae0) = 0
...

ko:

ioctl(6, _IOC(_IOC_NONE, 0, 0x25, 0), 0x7ffec0ca2390) = 0
ioctl(6, _IOC(_IOC_NONE, 0, 0x18, 0), 0x7ffec0ca2400) = 0
munmap(0x7f590187a000, 659456)          = 0
...

where 6 is:

openat(AT_FDCWD, "/dev/nvidia-uvm", O_RDWR|O_CLOEXEC) = 6

Then on success it loads /usr/lib64/libcuda.so.1 and proceeded, on fail it starts closing everything.

Enabling # modprobe nvidia_uvm uvm_debug_prints=1 uvm_enable_builtin_tests=1 uvm_debug_enable_push_desc=1 reveals the following lines in dmesg (cut):

uvm_channel.c:461 uvm_channel_check_errors[pid:3385] Detected a channel error, channel ID 2 (0x2) CE 0 GPU ID 1
uvm_global.c:410 uvm_global_set_fatal_error_impl[pid:3385] Encountered a global fatal error: Generic RC error [NV_ERR_RC_ERROR]
uvm_channel.c:652 init_channel[pid:3385] Channel init failed: Generic RC error [NV_ERR_RC_ERROR], GPU ID 1
uvm_gpu.c:1130 init_gpu[pid:3385] Failed to initialize the channel manager: Generic RC error [NV_ERR_RC_ERROR], GPU ID 1

Sometimes there is also additional line after the first error:

uvm_channel.c:468 uvm_channel_check_errors[pid:3385] Channel error likely caused by push 'Init channel' started at uvm_channel.c:641 in init_channel()

channel ID in most cases is 2, sometimes 3 but I never saw number less then 2. So I suspect those success runs was with channel ID 1 or so.

I tried hard to figure out if this is setup problem, changed different kernel options to ensure CONFIG_NUMA=y is set and so on, but no luck so far. The behavior always the same.

Apologies for my English.

generix · February 11, 2021, 4:59pm

Already tried a 5.4 kernel?

ChinaphoneOne · February 11, 2021, 6:07pm

No, but why not. Currently tested with 5.10.{14,15}, but I guess the problem was there for a while. I didn’t run cuda/CL programs so often.
Thanks for suggestion, I’ll try 5.4 kernel.

ChinaphoneOne · February 11, 2021, 8:44pm

Okay, puzzle solved.
InitializeSystemMemoryAllocations=0 nvidia module option breaks nvidia-uvm module.

/*
 * Option: InitializeSystemMemoryAllocations
 *
 * Description:
 *
 * The NVIDIA Linux driver normally clears system memory it allocates
 * for use with GPUs or within the driver stack. This is to ensure
 * that potentially sensitive data is not rendered accessible by
 * arbitrary user applications.
 *
 * Owners of single-user systems or similar trusted configurations may
 * choose to disable the aforementioned clears using this option and
 * potentially improve performance.
 *
 * Possible values:
 *
 *  1 = zero out system memory allocations (default)
 *  0 = do not perform memory clears
 */

5.4 and 5.10 kernels both KO with this and both OK when reset to default.

Topic		Replies	Views
340.106 nvidia-uvm.ko fails to build under kernel 4.14.y Linux	16	7633	October 14, 2021
Nvidia GeForce GTX 560 M not recognized as an OpenCL platform with Ubuntu HWE-5.15 Linux	12	1929	December 12, 2022
Opencl problems in Linux Ubuntu Linux	1	756	January 1, 2018
/dev/nvidia-uvm IO error on Ubuntu 22.04, 520 to 535 driver versions Linux cuda , opencl , linux-driver	2	3502	August 27, 2023
openSUSE Tumbleweed, kernel 5.0.5-1: nvidia-uvm module 418.56 does not load - Unknown symbol __pcpu_... Linux	27	3977	October 12, 2021
NVreg_InitializeSystemMemoryAllocations and CUDA UVM Bus errors CUDA Setup and Installation	0	1746	July 14, 2014
Cuda broken in 396.24.02 and 396.24.10 Vulkan beta drivers on Linux Linux	47	9712	October 14, 2021
OpenCL problem on OpenSUSE 15.2: /dev/nvidia-uvm missing CUDA Setup and Installation cuda	0	652	July 6, 2020
Ubuntu 20.04 - CUDA 11.1.1: Missing nvidia-uvm Frameworks (archived) cuda	4	8488	October 12, 2021
334.21 driver returns 999 on cuInit (CUDA) Linux	8	31752	April 11, 2014

Cuda or OpenCL 32bit - OK, 64bit - KO. Why? (460.39, nvidia-uvm)

Related topics