Failed call to cuInit: CUDA_ERROR_UNKNOWN: unknown error

Wibur231 · May 24, 2024, 3:46pm

2024-05-24 16:36:50.066251: E external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:282] failed call to cuInit: CUDA_ERROR_UNKNOWN: unknown error
2024-05-24 16:36:50.066269: I external/local_xla/xla/stream_executor/cuda/cuda_diagnostics.cc:134] retrieving CUDA diagnostic information for host: e
2024-05-24 16:36:50.066274: I external/local_xla/xla/stream_executor/cuda/cuda_diagnostics.cc:141] hostname: e
2024-05-24 16:36:50.066322: I external/local_xla/xla/stream_executor/cuda/cuda_diagnostics.cc:165] libcuda reported version is: 555.42.2
2024-05-24 16:36:50.066336: I external/local_xla/xla/stream_executor/cuda/cuda_diagnostics.cc:169] kernel reported version is: 555.42.2
2024-05-24 16:36:50.066340: I external/local_xla/xla/stream_executor/cuda/cuda_diagnostics.cc:248] kernel version seems to match DSO: 555.42.2

Hmm?

e@e:/usr/local$ ls
bin  cuda  cuda-12  cuda-12.5  etc  games  include  lib  man  sbin  share  src

Kinda strange not sure how to get this working.

generix · May 25, 2024, 12:26am

error unknown is always great. Please run nvidia-bug-report.sh as root and attach the resulting nvidia-bug-report.log.gz file to your post.

scarletemerald · May 25, 2024, 12:48am

I had the same issue. I and some others mentioned it in the pinned 555 release feedback & discussion thread.

I found a work-around in this thread that seems to work for me. Basically just sudo deviceQuery. Regular deviceQuery will give the Unknown Error, but when run as root it will work, and then it will continue to work when run as the normal user. Other CUDA programs will work properly then as well. I can only guess that this triggers the driver to be loaded somehow.

generix · May 25, 2024, 12:55am

That would rather point to the nvidia-uvm module not being loaded and nvidia-modprobe not being installed so users can’t load the module.

scarletemerald · May 25, 2024, 1:07am

The nvidia-uvm module is not loaded after booting. It is also not loaded after trying to run deviceQuery as a normal user. After running deviceQuery as root, nvidia-uvm is loaded.

However, nvidia-modprobe is installed

nvidia-modprobe/unknown,now 555.42.02-1 amd64 [installed,automatic]

scarletemerald · May 25, 2024, 1:29am

Running /usr/bin/nvidia-modprobe -u does not load nvidia-uvm. However, running sudo /usr/bin/nvidia-modprobe -u does load it. This is weird since nvidia-modprobe seems to be installed as setuid root, as it should be, so I don’t understand how running it with sudo could make a difference.

Edit:
It looks like Nvidia changed how nvidia-modprobe spawns off modprobe in 555.42.02 here. I guess the setuid doesn’t survive the new method somehow. Hopefully an Nvidia dev can have a look and fix it.

Wibur231 · May 25, 2024, 8:26am

I don’t seem to have NVCC installed but trying to apt install cuda-nvcc-12-5 claims this package is already installed. (so no sudo deviceQuery)

Having called /usr/bin/nvidia-modprobe -u the error transformed back into:

2024-05-25 09:20:49.450879: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-05-25 09:20:49.466412: W tensorflow/core/common_runtime/gpu/gpu_device.cc:2251] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...

This familiar error which was once solved by setting the environment variables:

export CUDNN_PATH="$HOME/.local/lib/python3.11/site-packages/nvidia/cudnn"
export LD_LIBRARY_PATH="$CUDNN_PATH/lib":"/usr/local/cuda/lib64"
export PATH="$PATH":"/usr/local/cuda/bin"

I also tried the 12.5 variants specifically:

export LD_LIBRARY_PATH="$CUDNN_PATH/lib":"/usr/local/cuda-12.5/lib64"
export PATH="$PATH":"/usr/local/cuda-12.5/bin"

No luck, but I have generated a nvidia-bug-report.log:
nvidia-bug-report.log.gz (912.9 KB)

And this is the nvidia-bug-report.log after running /usr/bin/nvidia-modprobe -u.
nvidia-bug-report.log.gz (920.9 KB)

generix · May 25, 2024, 6:43pm

For a workaround, you could just add nvidia-uvm to the initrd and set it to load on boot.

Wibur231 · May 25, 2024, 7:38pm

How would I solve the Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. issue though?

Topic		Replies	Views
Ubuntu 20.04 - CUDA 11.1.1: Missing nvidia-uvm Frameworks cuda	4	8082	October 12, 2021
CUDA 10.0: cuInit Error [Ubuntu 14.04][GeForce GTX 1080] CUDA Setup and Installation	1	593	March 30, 2019
Failed call to cuInit CUDA_ERROR_NOT_INITIALIZED (Device mapping: no known devices) CUDA Setup and Installation	7	6479	November 27, 2018
CUDA_ERROR_UNKNOWN unknown error Deep Learning (Training & Inference) opencv , cuda , tensorflow , python	1	1845	May 12, 2022
CUDA is not active unless I run it with sudo privillages ? CUDA Setup and Installation	8	25410	January 13, 2018
modprobe: ERROR: could not insert 'nvidia_340_uvm' CUDA Setup and Installation	3	5891	October 4, 2016
"unknown error" from CUDA 11.7 (Ubuntu 22.04 64bit) Linux cuda	16	2833	August 5, 2022
Unable to detect CUDA-capable device after automatic/forced NVIDIA updated CUDA Setup and Installation	4	10885	December 2, 2015
Cant manage install RTX 3070 drivers and CUDA Linux	4	1644	February 22, 2022
Cuda Installation : Running deviceQuery - Unknown Error Linux	0	1128	November 16, 2016

Failed call to cuInit: CUDA_ERROR_UNKNOWN: unknown error

Related topics