AWS Linux 2 - can't install Nvidia Drivers

Following instructions from here NVIDIA Driver Installation Quickstart Guide :: NVIDIA Tesla Documentation recommended from Install NVIDIA drivers on Linux instances - Amazon Elastic Compute Cloud “Option 2: Public NVIDIA drivers”

The instructions on the nvidia website all the way through sudo yum install -y nvidia-driver-latest-dkms works and installs successfully but get the following error running nvidia-smi.


$ nvidia-smi

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

Running sudo dkms install nvidia/545.23.06

Generates


Kernel preparation unnecessary for this kernel. Skipping...

Building module:

cleaning build area...

'make' -j2 module SYSSRC=/lib/modules/5.10.197-186.748.amzn2.x86_64/build IGNORE_XEN_PRESENCE=1 IGNORE_PREEMPT_RT_PRESENCE=1 IGNORE_CC_MISMATCH=1...........(bad exit status: 2)

Error! Bad return status for module build on kernel: 5.10.197-186.748.amzn2.x86_64 (x86_64)

Consult /var/lib/dkms/nvidia/545.23.06/build/make.log for more information.

The error file above is like 12k lines long but the bottom ~200 lines or so show

./include/asm-generic/bug.h:94:19: note: in expansion of macro ‘__WARN_FLAGS’

#define __WARN() __WARN_FLAGS(BUGFLAG_TAINT(TAINT_WARN))

^~~~~~~~~~~~

./include/asm-generic/bug.h:121:3: note: in expansion of macro ‘__WARN’

__WARN(); \

^~~~~~

/var/lib/dkms/nvidia/545.23.06/build/nvidia/nv.c:4923:5: note: in expansion of macro ‘WARN_ON’

WARN_ON(rm_set_external_kernel_client_count(sp, NV_STATE_PTR(nvl), NV_FALSE) != NV_OK);

^~~~~~~

In file included from <command-line>:0:0:

././include/linux/compiler_types.h:245:24: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement]

#define asm_inline asm __inline

^

./arch/x86/include/asm/bug.h:36:2: note: in expansion of macro ‘asm_inline’

asm_inline volatile("1:\t" ins "\n" \

^~~~~~~~~~

./arch/x86/include/asm/bug.h:88:2: note: in expansion of macro ‘_BUG_FLAGS’

_BUG_FLAGS(ASM_UD2, BUGFLAG_WARNING|(flags)); \

^~~~~~~~~~

./include/asm-generic/bug.h:94:19: note: in expansion of macro ‘__WARN_FLAGS’

#define __WARN() __WARN_FLAGS(BUGFLAG_TAINT(TAINT_WARN))

^~~~~~~~~~~~

./include/asm-generic/bug.h:121:3: note: in expansion of macro ‘__WARN’

__WARN(); \

^~~~~~

/var/lib/dkms/nvidia/545.23.06/build/nvidia/nv.c:4923:5: note: in expansion of macro ‘WARN_ON’

WARN_ON(rm_set_external_kernel_client_count(sp, NV_STATE_PTR(nvl), NV_FALSE) != NV_OK);

^~~~~~~

/var/lib/dkms/nvidia/545.23.06/build/nvidia/nv.c: In function ‘nv_s2idle_pm_configured’:

/var/lib/dkms/nvidia/545.23.06/build/nvidia/nv.c:5531:23: error: ‘IOPRIO_DEFAULT’ undeclared (first use in this function); did you mean ‘LMI_DEFAULT’?

kiocb.ki_ioprio = IOPRIO_DEFAULT;

^~~~~~~~~~~~~~

LMI_DEFAULT

/var/lib/dkms/nvidia/545.23.06/build/nvidia/nv.c:5531:23: note: each undeclared identifier is reported only once for each function it appears in

cc1: some warnings being treated as errors

make[5]: *** [/var/lib/dkms/nvidia/545.23.06/build/nvidia/nv.o] Error 1

make[4]: *** [/var/lib/dkms/nvidia/545.23.06/build] Error 2

make[3]: *** [modules] Error 2

make[2]: *** [__sub-make] Error 2

make[2]: Leaving directory `/usr/src/kernels/5.10.197-186.748.amzn2.x86_64'

make[1]: *** [modules] Error 2

make[1]: Leaving directory `/usr/src/kernels/5.10.197-186.748.amzn2.x86_64'

make: *** [modules] Error 2

Any suggestions on how to fix?

Hello @user8250 and welcome to the NVIDIA developer forums.

The initial error message is a good sign for a failed driver installation. To look for reasons for it it is helpful to run nvidia-bug-report.sh and check the output, and attach it here so we can look at it.

Secondly I do not think that we have certified drivers based r545 for datacenters yet, they should still be at r535.
What GPU Hardware does your instance use?

Lastly, the kind of errors point towards either a wrong kernel version being used or a mismatch between the system kernel version and the kernel headers needed for kernel module compilation.

I also recommend to check the standard Linux driver documentation, it covers additional troubleshooting.

But in general I would suggest to take it up with AWS support since this might be an issue with their base image they are deploying.

Thanks!