CUDA install fail on Amazon Linux: "NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver."

Hi, I am trying to install CUDA on an AWS EC2 g4dn.xlarge instance. I’m following the instructions from here: CUDA Installation Guide for Linux. I did this a month ago or so and it worked fine, but now it started failing :/ I’ve already spent some time investigating this, but I cannot resolve the problem. Any help would be greatly appreciated!

I’m starting from a fresh basic Amazon Linux image (ami-0669b163befffbdfc). I do the pre-installation actions:

[ec2-user@i-03018febbaff59efb ~]$ lspci | grep -i nvidia
00:1e.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1)

[ec2-user@i-03018febbaff59efb ~]$ uname -m && cat /etc/*release
x86_64
Amazon Linux release 2023 (Amazon Linux)
NAME="Amazon Linux"
VERSION="2023"
ID="amzn"
ID_LIKE="fedora"
VERSION_ID="2023"
PLATFORM_ID="platform:al2023"
PRETTY_NAME="Amazon Linux 2023"
ANSI_COLOR="0;33"
CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2023"
HOME_URL="https://aws.amazon.com/linux/"
BUG_REPORT_URL="https://github.com/amazonlinux/amazon-linux-2023"
SUPPORT_END="2028-03-15"
Amazon Linux release 2023 (Amazon Linux)

It doesn’t come with gcc, so install it and then

[ec2-user@i-03018febbaff59efb ~]$ gcc --version
gcc (GCC) 11.4.1 20230605 (Red Hat 11.4.1-2)
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Then, I do follow the instructions for Fedora (Section 3.6.1)

I have

[ec2-user@i-03018febbaff59efb ~]$ uname -r
6.1.61-85.141.amzn2023.x86_64

and I install the relevant kernel-devel and kernel-headers.

Removing the outdated keys fails:

[ec2-user@i-03018febbaff59efb ~]$ sudo rpm --erase gpg-pubkey-7fa2af80*
error: package gpg-pubkey-7fa2af80* is not installed

Then, I follow the network repo installation for fedora37. Installing the nvidia-drivers and cuda-toolkit seems fine (logs attached). Then, I do the final instructions (3.6.4)

[ec2-user@i-03018febbaff59efb lib64]$ ls /usr/lib64/libcuda* -l
lrwxrwxrwx. 1 root root 20 Nov 7 05:22 /usr/lib64/libcuda.so → libcuda.so.545.23.08
lrwxrwxrwx. 1 root root 20 Nov 7 05:22 /usr/lib64/libcuda.so.1 → libcuda.so.545.23.08
-rwxr-xr-x. 1 root root 29453200 Nov 7 00:49 /usr/lib64/libcuda.so.545.23.08
lrwxrwxrwx. 1 root root 28 Nov 7 05:22 /usr/lib64/libcudadebugger.so.1 → libcudadebugger.so.545.23.08
-rwxr-xr-x. 1 root root 10593576 Nov 7 00:14 /usr/lib64/libcudadebugger.so.545.23.08

And then I do the post-installation actions to get:

[ec2-user@i-03018febbaff59efb local]$ echo $PATH
/usr/local/cuda-12.3/bin:/home/ec2-user/.local/bin:/home/ec2-user/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin

[ec2-user@i-03018febbaff59efb local]$ echo $LD_LIBRARY_PATH
/usr/local/cuda-12.3/lib64

After all of this, nvidia-smi fails:

[ec2-user@i-03018febbaff59efb ~]$ nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

This is what I find in dmesg:

[ec2-user@i-03018febbaff59efb local]$ dmesg
...
[    2.690763] nvidia: loading out-of-tree module taints kernel.
[    2.691647] nvidia: module license 'NVIDIA' taints kernel.
[    2.692444] Disabling lock debugging due to kernel taint
[    2.742885] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[    2.744378] nvidia: Unknown symbol drm_gem_object_free (err -2)
[    2.838904] zram_generator::config[1757]: zram0: system has too much memory (15779MB), limit is 800MB, ignoring.
[    2.839269] systemd-sysv-generator[1755]: SysV service '/etc/rc.d/init.d/cfn-hup' lacks a native systemd unit file. Automatically generating a unit file for compatibility. Please update package to include a native systemd unit file, in order to make it more safe and robust.
[    2.845374] nvidia: Unknown symbol drm_gem_object_free (err -2)
[    3.229924] RPC: Registered named UNIX socket transport module.
[    3.230597] RPC: Registered udp transport module.
[    3.231125] RPC: Registered tcp transport module.
[    3.231645] RPC: Registered tcp NFSv4.1 backchannel transport module.
[    3.353325] ena 0000:00:05.0 ens5: Local page cache is disabled for less than 16 channels
[   21.054526] nvidia: Unknown symbol drm_gem_object_free (err -2)
[   21.157303] nvidia: Unknown symbol drm_gem_object_free (err -2)
[   21.286533] nvidia: Unknown symbol drm_gem_object_free (err -2)

lsmod output:

[ec2-user@i-03018febbaff59efb ~]$ lsmod
Module                  Size  Used by
nls_ascii              16384  1
sunrpc                692224  1
nls_cp437              20480  1
vfat                   24576  1
fat                    86016  1 vfat
ghash_clmulni_intel    16384  0
aesni_intel           393216  0
wmi                    36864  0
crypto_simd            16384  1 aesni_intel
i8042                  45056  0
cryptd                 28672  2 crypto_simd,ghash_clmulni_intel
i2c_core              106496  0
serio                  28672  3 i8042
ena                   163840  0
button                 24576  0
sch_fq_codel           20480  5
dm_mod                188416  0
fuse                  163840  1
loop                   32768  0
configfs               57344  1
dax                    45056  1 dm_mod
dmi_sysfs              20480  0
crc32_pclmul           16384  0
crc32c_intel           24576  0
efivarfs               24576  1

If I try to start the nvidia persistence daemon I get this in the system log:

Nov 23 10:21:50 i-03018febbaff59efb.eu-central-1.compute.internal nvidia-persistenced[2378]: Shutdown (2378)
Nov 23 10:21:50 i-03018febbaff59efb.eu-central-1.compute.internal nvidia-persistenced[2378]: Failed to query NVIDIA devices. Please ensure that the NVIDIA device files (/dev/nvidia*) exist, and that user 0 has read and write permis>
Nov 23 10:21:50 i-03018febbaff59efb.eu-central-1.compute.internal kernel: nvidia: Unknown symbol drm_gem_object_free (err -2)
Nov 23 10:21:50 i-03018febbaff59efb.eu-central-1.compute.internal nvidia-persistenced[2378]: Started (2378)

cuda-toolkit-install.log (33.9 KB)
nvidia-driver-install.log (56.7 KB)

Any help will be very much appreciated :)

The missing symbols error points to the nvidia-drm and maybe also nvidia-modeset modules missing. Maybe they’re just not embedded into the initrd? Please run nvidia-bug-report.sh as root and attach the resulting nvidia-bug-report.log.gz file to your post.

Great, thanks for your response :) here is the bug report
nvidia-bug-report.log.gz (39.5 KB)

The modules are compiled fine, Please check if they’re installed correctly
modinfo nvidia-drm
If so, please run
sudo depmod -a
Then try to load the nvidia module manually
sudo rmmod nvidia
sudo modprobe nvidia
and check dmesg if nvidia and nvidia-drm is loaded.

Sorry, I think you can forget about fiddling with the nvidia modules, I just noticed the whole drm subsystem module is missing from the kernel. Since this seems to be some amazon specific kernel, I don’t know how the package is called. Please try installing “linux-modules-$(uname -r)” then check if the drm.ko module is available.

Alright, this was super helpful! Indeed

sudo dnf install kernel-modules-extra.x86_64

solved the problem, thanks a lot :)

3 Likes

yes this worked for me as well