Failed call to cuInit: CUDA_ERROR_UNKNOWN: unknown error

2024-05-24 16:36:50.066251: E external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:282] failed call to cuInit: CUDA_ERROR_UNKNOWN: unknown error
2024-05-24 16:36:50.066269: I external/local_xla/xla/stream_executor/cuda/cuda_diagnostics.cc:134] retrieving CUDA diagnostic information for host: e
2024-05-24 16:36:50.066274: I external/local_xla/xla/stream_executor/cuda/cuda_diagnostics.cc:141] hostname: e
2024-05-24 16:36:50.066322: I external/local_xla/xla/stream_executor/cuda/cuda_diagnostics.cc:165] libcuda reported version is: 555.42.2
2024-05-24 16:36:50.066336: I external/local_xla/xla/stream_executor/cuda/cuda_diagnostics.cc:169] kernel reported version is: 555.42.2
2024-05-24 16:36:50.066340: I external/local_xla/xla/stream_executor/cuda/cuda_diagnostics.cc:248] kernel version seems to match DSO: 555.42.2

Hmm?

e@e:/usr/local$ ls
bin  cuda  cuda-12  cuda-12.5  etc  games  include  lib  man  sbin  share  src

Kinda strange not sure how to get this working.

1 Like

error unknown is always great. Please run nvidia-bug-report.sh as root and attach the resulting nvidia-bug-report.log.gz file to your post.

I had the same issue. I and some others mentioned it in the pinned 555 release feedback & discussion thread.

I found a work-around in this thread that seems to work for me. Basically just sudo deviceQuery. Regular deviceQuery will give the Unknown Error, but when run as root it will work, and then it will continue to work when run as the normal user. Other CUDA programs will work properly then as well. I can only guess that this triggers the driver to be loaded somehow.

1 Like

That would rather point to the nvidia-uvm module not being loaded and nvidia-modprobe not being installed so users can’t load the module.

The nvidia-uvm module is not loaded after booting. It is also not loaded after trying to run deviceQuery as a normal user. After running deviceQuery as root, nvidia-uvm is loaded.

However, nvidia-modprobe is installed

nvidia-modprobe/unknown,now 555.42.02-1 amd64 [installed,automatic]

Running /usr/bin/nvidia-modprobe -u does not load nvidia-uvm. However, running sudo /usr/bin/nvidia-modprobe -u does load it. This is weird since nvidia-modprobe seems to be installed as setuid root, as it should be, so I don’t understand how running it with sudo could make a difference.

Edit:
It looks like Nvidia changed how nvidia-modprobe spawns off modprobe in 555.42.02 here. I guess the setuid doesn’t survive the new method somehow. Hopefully an Nvidia dev can have a look and fix it.

I don’t seem to have NVCC installed but trying to apt install cuda-nvcc-12-5 claims this package is already installed. (so no sudo deviceQuery)

Having called /usr/bin/nvidia-modprobe -u the error transformed back into:

2024-05-25 09:20:49.450879: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-05-25 09:20:49.466412: W tensorflow/core/common_runtime/gpu/gpu_device.cc:2251] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...

This familiar error which was once solved by setting the environment variables:

export CUDNN_PATH="$HOME/.local/lib/python3.11/site-packages/nvidia/cudnn"
export LD_LIBRARY_PATH="$CUDNN_PATH/lib":"/usr/local/cuda/lib64"
export PATH="$PATH":"/usr/local/cuda/bin"

I also tried the 12.5 variants specifically:

export LD_LIBRARY_PATH="$CUDNN_PATH/lib":"/usr/local/cuda-12.5/lib64"
export PATH="$PATH":"/usr/local/cuda-12.5/bin"

No luck, but I have generated a nvidia-bug-report.log:
nvidia-bug-report.log.gz (912.9 KB)

And this is the nvidia-bug-report.log after running /usr/bin/nvidia-modprobe -u.
nvidia-bug-report.log.gz (920.9 KB)

For a workaround, you could just add nvidia-uvm to the initrd and set it to load on boot.

How would I solve the Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. issue though?