Ubuntu 20.04 - CUDA 11.1.1: Missing nvidia-uvm

Hey,

I’m trying to access my RTX 2080ti GPU using a built version of Tensorflow 2.4.0-rc1, however, I’m getting the following error:

2020-11-12 16:12:21.876416: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2020-11-12 16:12:21.957919: E tensorflow/stream_executor/cuda/cuda_driver.cc:328] failed call to cuInit: CUDA_ERROR_UNKNOWN: unknown error
2020-11-12 16:12:21.958051: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: hoopoe-u-u
2020-11-12 16:12:21.958082: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: hoopoe-u-u
2020-11-12 16:12:21.958297: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: 455.32.0
2020-11-12 16:12:21.958396: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 455.32.0
2020-11-12 16:12:21.958423: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:310] kernel version seems to match DSO: 455.32.0

Afer checking ls /dev/nvidia*, the present files are:

/dev/nvidia0  /dev/nvidiactl  /dev/nvidia-modeset

/dev/nvidia-caps:
nvidia-cap1  nvidia-cap2

After running the Device Node Verification script for enabling/fixing nvidia_uvm, I get:

mknod: /dev/nvidia0: File exists
mknod: /dev/nvidiactl: File exists
modprobe: ERROR: could not insert 'nvidia_uvm': Unknown symbol in module, or unknown parameter (see dmesg)

Then checking the dmesg, the last outputs:

[ 4025.104864] audit: type=1400 audit(1605194221.534:3846): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd//null-/usr/libexec/sssd/sssd_nss" name="/proc/16980/cmdline" pid=811 comm="sssd_nss" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
[ 4025.105569] audit: type=1400 audit(1605194221.534:3847): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd//null-/usr/libexec/sssd/sssd_nss" name="/proc/16981/cmdline" pid=811 comm="sssd_nss" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
[ 4037.833657] audit: type=1400 audit(1605194234.263:3848): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd//null-/usr/libexec/sssd/sssd_nss" name="/proc/16987/cmdline" pid=811 comm="sssd_nss" requested_mask="r" denied_mask="r" fsuid=0 ouid=1168429606
[ 4043.073973] audit: type=1400 audit(1605194239.503:3849): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd//null-/usr/libexec/sssd/sssd_nss" name="/proc/17003/cmdline" pid=811 comm="sssd_nss" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
[ 4043.085345] audit: type=1400 audit(1605194239.515:3850): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd//null-/usr/libexec/sssd/sssd_pam" name="/proc/17003/cmdline" pid=812 comm="sssd_pam" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
[ 4043.136344] audit: type=1400 audit(1605194239.567:3851): apparmor="ALLOWED" operation="mknod" profile="/usr/sbin/sssd//null-/usr/libexec/sssd/sssd_be" name="/var/lib/sss/pubconf/.krb5info_dummy_HvXPyr" pid=810 comm="sssd_be" requested_mask="c" denied_mask="c" fsuid=0 ouid=0
[ 4043.136346] audit: type=1400 audit(1605194239.567:3852): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd//null-/usr/libexec/sssd/sssd_be" name="/var/lib/sss/pubconf/.krb5info_dummy_HvXPyr" pid=810 comm="sssd_be" requested_mask="wrc" denied_mask="wrc" fsuid=0 ouid=0
[ 4043.136347] audit: type=1400 audit(1605194239.567:3853): apparmor="ALLOWED" operation="chmod" profile="/usr/sbin/sssd//null-/usr/libexec/sssd/sssd_be" name="/var/lib/sss/pubconf/.krb5info_dummy_HvXPyr" pid=810 comm="sssd_be" requested_mask="w" denied_mask="w" fsuid=0 ouid=0
[ 4043.136348] audit: type=1400 audit(1605194239.567:3854): apparmor="ALLOWED" operation="rename_src" profile="/usr/sbin/sssd//null-/usr/libexec/sssd/sssd_be" name="/var/lib/sss/pubconf/.krb5info_dummy_HvXPyr" pid=810 comm="sssd_be" requested_mask="wrd" denied_mask="wrd" fsuid=0 ouid=0
[ 4043.136349] audit: type=1400 audit(1605194239.567:3855): apparmor="ALLOWED" operation="rename_dest" profile="/usr/sbin/sssd//null-/usr/libexec/sssd/sssd_be" name="/var/lib/sss/pubconf/kdcinfo.AD.IGD.FRAUNHOFER.DE" pid=810 comm="sssd_be" requested_mask="wc" denied_mask="wc" fsuid=0 ouid=0
[ 4043.137317] audit: type=1400 audit(1605194239.567:3856): apparmor="ALLOWED" operation="exec" profile="/usr/sbin/sssd//null-/usr/libexec/sssd/sssd_be" name="/usr/libexec/sssd/ldap_child" pid=17008 comm="sssd_be" requested_mask="x" denied_mask="x" fsuid=0 ouid=0 target="/usr/sbin/sssd//null-/usr/libexec/sssd/sssd_be//null-/usr/libexec/sssd/ldap_child"
[ 4043.137892] audit: type=1400 audit(1605194239.567:3857): apparmor="ALLOWED" operation="file_mmap" profile="/usr/sbin/sssd//null-/usr/libexec/sssd/sssd_be//null-/usr/libexec/sssd/ldap_child" name="/usr/libexec/sssd/ldap_child" pid=17008 comm="ldap_child" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
[ 4043.137894] audit: type=1400 audit(1605194239.567:3858): apparmor="ALLOWED" operation="file_mmap" profile="/usr/sbin/sssd//null-/usr/libexec/sssd/sssd_be//null-/usr/libexec/sssd/ldap_child" name="/usr/lib/x86_64-linux-gnu/ld-2.31.so" pid=17008 comm="ldap_child" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
[ 4043.442952] nvidia_uvm: module uses symbols from proprietary module nvidia, inheriting taint.
[ 4043.443113] nvidia_uvm: Unknown symbol radix_tree_preloads (err -2)
[ 4043.443142] nvidia_uvm: Unknown symbol set_cpus_allowed_ptr (err -2)
[ 4043.443175] nvidia_uvm: Unknown symbol mmu_notifier_unregister (err -2)
[ 4043.443253] nvidia_uvm: Unknown symbol __mmu_notifier_register (err -2)
[ 4087.312635] kauditd_printk_skb: 86 callbacks suppressed
[ 4087.312636] audit: type=1400 audit(1605194283.743:3945): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd//null-/usr/libexec/sssd/sssd_nss" name="/proc/17046/cmdline" pid=811 comm="sssd_nss" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
[ 4087.314778] audit: type=1400 audit(1605194283.743:3946): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd//null-/usr/libexec/sssd/sssd_pam" name="/proc/17046/cmdline" pid=812 comm="sssd_pam" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
[ 4087.344651] audit: type=1400 audit(1605194283.775:3947): apparmor="ALLOWED" operation="capable" profile="/usr/sbin/sssd//null-/usr/libexec/sssd/sssd_be" pid=810 comm="sssd_be" capability=2  capname="dac_read_search"

I’m trying to enable Tensorflow GPU access, and fixing this nvidia_uvm might be the solution for this problem. Can you please assist me in this? The built tensorflow python wheel was tested in a nvidia-docker container with Cuda 11.1.1 and Ubuntu 20.04 and worked fine. However, there is a problem accessing the gpu on my host machine outside the container …

SOLVED. The problem was updating my linux-kernel to the latest one (5.9.8). Rolling back to v5.4 solved the problem since nvidia_uvm which is needed for CUDA is not supported for the latest unsigned linux-kernel.

Please I urgently need your help since I’m new to nvidia/cuda.
I install cuda-11.1 toolkit on Friday, that is two days ago. But I have never succeed to pass the test, meaning that no program is running after compilation.
I am using the office Laptop hp-probook with Geforce MX130.

  • My default kernel was 5.4.0.54…60 but in order to install cuda-11.1 I updated my Linux-kernel to the latest one 5.9.12. The first problem I encountered was some missing firmware in rtl_nicwhich I solve by downloading firmware from git.
    But unfortunately after fixing that issue my tests were still unsuccessful.
  • Each time I run ./nbody
    I get an error message: “Error: only 0 Devices available, 1 requested. Exiting.” regardless to the option or parameters I append.

  • If I run ./deviceQuery
    I get the message:
    cudaGetDeviceCount returned 999

  • When after reading this post, I reverted back to the Linux-kernel to the previous version 5.4.0.54.60. And when I run
    $ ./deviceQuery
    the result/output is:
    ./deviceQuery Starting…

CUDA Device Query (Runtime API) version (CUDART static linking)

cudaGetDeviceCount returned 100
→ no CUDA-capable device is detected
Result = FAIL

Running a startup file as indicated on https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#post-installation-actions, section 7.4. Devive Node Verification, I get the following output:
modprobe: FATAL: Module nvidia not found in directory /lib/modules/5.4.0-54-generic
Besides:
$ nvidia-smi -L
Returns
NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

Thus for my system the Linux-kernel 5.4.0.54.60 does not support nvidia. My only hope is with v. 5.9.12.
Is there anything I can do to resolve this problem? Your help will be highly appreciated.

Regards

PS: As I was reading at the same time this post:
https://www.archlinux.org/news/nvidia-45528-is-incompatible-with-linux-59/

nvidia 455.28 is incompatible with linux >= 5.9


2020-10-21 - Sven-Hendrik Haase
nvidia is currently partially incompatible with linux >= 5.9 …
While graphics should work fine, CUDA, OpenCL, and likely other features are broken. Users who’ve already upgraded and need those features are advised to switch to the linux-lts kernel for the time being until a fix for nvidia is available.

It is possible to upgrade my kernel to maybe 5.8. I will let you know if I succeed to resolve my problem. Cheers.

[Solved]
As noticed at the end of my post, the problem has been solved using linux-kernel 5.8.18.
Everything works smoothly. :)