Hi, I am trying to install CUDA on an AWS EC2 g4dn.xlarge
instance. I’m following the instructions from here: CUDA Installation Guide for Linux. I did this a month ago or so and it worked fine, but now it started failing :/ I’ve already spent some time investigating this, but I cannot resolve the problem. Any help would be greatly appreciated!
I’m starting from a fresh basic Amazon Linux image (ami-0669b163befffbdfc). I do the pre-installation actions:
[ec2-user@i-03018febbaff59efb ~]$ lspci | grep -i nvidia
00:1e.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1)
[ec2-user@i-03018febbaff59efb ~]$ uname -m && cat /etc/*release
x86_64
Amazon Linux release 2023 (Amazon Linux)
NAME="Amazon Linux"
VERSION="2023"
ID="amzn"
ID_LIKE="fedora"
VERSION_ID="2023"
PLATFORM_ID="platform:al2023"
PRETTY_NAME="Amazon Linux 2023"
ANSI_COLOR="0;33"
CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2023"
HOME_URL="https://aws.amazon.com/linux/"
BUG_REPORT_URL="https://github.com/amazonlinux/amazon-linux-2023"
SUPPORT_END="2028-03-15"
Amazon Linux release 2023 (Amazon Linux)
It doesn’t come with gcc, so install it and then
[ec2-user@i-03018febbaff59efb ~]$ gcc --version
gcc (GCC) 11.4.1 20230605 (Red Hat 11.4.1-2)
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Then, I do follow the instructions for Fedora (Section 3.6.1)
I have
[ec2-user@i-03018febbaff59efb ~]$ uname -r
6.1.61-85.141.amzn2023.x86_64
and I install the relevant kernel-devel and kernel-headers.
Removing the outdated keys fails:
[ec2-user@i-03018febbaff59efb ~]$ sudo rpm --erase gpg-pubkey-7fa2af80*
error: package gpg-pubkey-7fa2af80* is not installed
Then, I follow the network repo installation for fedora37
. Installing the nvidia-drivers and cuda-toolkit seems fine (logs attached). Then, I do the final instructions (3.6.4)
[ec2-user@i-03018febbaff59efb lib64]$ ls /usr/lib64/libcuda* -l
lrwxrwxrwx. 1 root root 20 Nov 7 05:22 /usr/lib64/libcuda.so → libcuda.so.545.23.08
lrwxrwxrwx. 1 root root 20 Nov 7 05:22 /usr/lib64/libcuda.so.1 → libcuda.so.545.23.08
-rwxr-xr-x. 1 root root 29453200 Nov 7 00:49 /usr/lib64/libcuda.so.545.23.08
lrwxrwxrwx. 1 root root 28 Nov 7 05:22 /usr/lib64/libcudadebugger.so.1 → libcudadebugger.so.545.23.08
-rwxr-xr-x. 1 root root 10593576 Nov 7 00:14 /usr/lib64/libcudadebugger.so.545.23.08
And then I do the post-installation actions to get:
[ec2-user@i-03018febbaff59efb local]$ echo $PATH
/usr/local/cuda-12.3/bin:/home/ec2-user/.local/bin:/home/ec2-user/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin
[ec2-user@i-03018febbaff59efb local]$ echo $LD_LIBRARY_PATH
/usr/local/cuda-12.3/lib64
After all of this, nvidia-smi fails:
[ec2-user@i-03018febbaff59efb ~]$ nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
This is what I find in dmesg:
[ec2-user@i-03018febbaff59efb local]$ dmesg
...
[ 2.690763] nvidia: loading out-of-tree module taints kernel.
[ 2.691647] nvidia: module license 'NVIDIA' taints kernel.
[ 2.692444] Disabling lock debugging due to kernel taint
[ 2.742885] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[ 2.744378] nvidia: Unknown symbol drm_gem_object_free (err -2)
[ 2.838904] zram_generator::config[1757]: zram0: system has too much memory (15779MB), limit is 800MB, ignoring.
[ 2.839269] systemd-sysv-generator[1755]: SysV service '/etc/rc.d/init.d/cfn-hup' lacks a native systemd unit file. Automatically generating a unit file for compatibility. Please update package to include a native systemd unit file, in order to make it more safe and robust.
[ 2.845374] nvidia: Unknown symbol drm_gem_object_free (err -2)
[ 3.229924] RPC: Registered named UNIX socket transport module.
[ 3.230597] RPC: Registered udp transport module.
[ 3.231125] RPC: Registered tcp transport module.
[ 3.231645] RPC: Registered tcp NFSv4.1 backchannel transport module.
[ 3.353325] ena 0000:00:05.0 ens5: Local page cache is disabled for less than 16 channels
[ 21.054526] nvidia: Unknown symbol drm_gem_object_free (err -2)
[ 21.157303] nvidia: Unknown symbol drm_gem_object_free (err -2)
[ 21.286533] nvidia: Unknown symbol drm_gem_object_free (err -2)
lsmod output:
[ec2-user@i-03018febbaff59efb ~]$ lsmod
Module Size Used by
nls_ascii 16384 1
sunrpc 692224 1
nls_cp437 20480 1
vfat 24576 1
fat 86016 1 vfat
ghash_clmulni_intel 16384 0
aesni_intel 393216 0
wmi 36864 0
crypto_simd 16384 1 aesni_intel
i8042 45056 0
cryptd 28672 2 crypto_simd,ghash_clmulni_intel
i2c_core 106496 0
serio 28672 3 i8042
ena 163840 0
button 24576 0
sch_fq_codel 20480 5
dm_mod 188416 0
fuse 163840 1
loop 32768 0
configfs 57344 1
dax 45056 1 dm_mod
dmi_sysfs 20480 0
crc32_pclmul 16384 0
crc32c_intel 24576 0
efivarfs 24576 1
If I try to start the nvidia persistence daemon I get this in the system log:
Nov 23 10:21:50 i-03018febbaff59efb.eu-central-1.compute.internal nvidia-persistenced[2378]: Shutdown (2378)
Nov 23 10:21:50 i-03018febbaff59efb.eu-central-1.compute.internal nvidia-persistenced[2378]: Failed to query NVIDIA devices. Please ensure that the NVIDIA device files (/dev/nvidia*) exist, and that user 0 has read and write permis>
Nov 23 10:21:50 i-03018febbaff59efb.eu-central-1.compute.internal kernel: nvidia: Unknown symbol drm_gem_object_free (err -2)
Nov 23 10:21:50 i-03018febbaff59efb.eu-central-1.compute.internal nvidia-persistenced[2378]: Started (2378)
cuda-toolkit-install.log (33.9 KB)
nvidia-driver-install.log (56.7 KB)
Any help will be very much appreciated :)