Anaconda kickstart failure Driver 450, CentOS 8.x, CUDA 11

Dear all,

we’re running an automatic CentOS 8.1 installation with configured NVIDIA CUDA 11 repository (http://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64) and the package nvidia-driver (nvidia-driver-450.36.06-1.el8.x86_64.rpm) to be installed.
We see the following error during kickstart installation (Performing an automated installation using Kickstart):

Installing nvidia-driver.x86_64 (508/611)
Installing nvidia-kmod-common.noarch (509/611)                                 

The installation was stopped due to an error which occurred while running in non-interactive cmdline mode. Since there cannot be any questions in cmdline mode,
edit your kickstart file and retry installation.                               
The exact error message is:

Non interactive installation failed: DNF error: Error in POSTIN scriptlet in rpm package nvidia-kmod-common.

The installer will now terminate.

In /tmp/anaconda.log in the crashed installer we see:

07:07:15,589 DBG exception: running handleException
07:07:15,590 CRT exception: Traceback (most recent call last):

  File "/usr/lib64/python3.6/site-packages/pyanaconda/threading.py", line 286, in run
    threading.Thread.run(self)

  File "/usr/lib64/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)

  File "/usr/lib64/python3.6/site-packages/pyanaconda/installation.py", line 388, in doInstall
    installation_queue.start()

  File "/usr/lib64/python3.6/site-packages/pyanaconda/installation_tasks.py", line 304, in start
    item.start()

  File "/usr/lib64/python3.6/site-packages/pyanaconda/installation_tasks.py", line 304, in start
    item.start()

  File "/usr/lib64/python3.6/site-packages/pyanaconda/installation_tasks.py", line 472, in start
    self.run_task()

  File "/usr/lib64/python3.6/site-packages/pyanaconda/installation_tasks.py", line 438, in run_task
    self._task(*self._task_args, **self._task_kwargs)

  File "/usr/lib64/python3.6/site-packages/pyanaconda/payload/dnfpayload.py", line 1092, in install
    if errors.errorHandler.cb(exc) == errors.ERROR_RAISE:

  File "/usr/lib64/python3.6/site-packages/pyanaconda/errors.py", line 329, in cb
    raise NonInteractiveError("Non interactive installation failed: %s" % exn)

pyanaconda.errors.NonInteractiveError: Non interactive installation failed: DNF error: Error in POSTIN scriptlet in rpm package nvidia-kmod-common

Manual re-install in the /mnt/sysimage/ environment reveals:

[anaconda root@casc-150 ~]# chroot /mnt/sysimage/
[anaconda root@casc-150 /]# rpm --reinstall /tmp/nvidia-kmod-common-450.36.06-1.el8.noarch.rpm                                                                                                                   
warning: /tmp/nvidia-kmod-common-450.36.06-1.el8.noarch.rpm: Header V4 RSA/SHA512 Signature, key ID 7fa2af80: NOKEY
/var/tmp/rpm-tmp.jmxmLy: line 2: /etc/default/grub: No such file or directory

/etc/default/grub is provided by grub2-tools:

[anaconda root@casc-150 /]# rpm -qf /etc/default/grub 
grub2-tools-2.02-78.el8_1.1.x86_64

grub2-tools may not yet be installed or configured during anaconda kickstart installation and the file is still missing resulting in the error.

The solution might be to require either grub2-tools or the /etc/default/grub in the nvidia-kmod-common RPM or to check for the existince of the /etc/default/grub file in the postinst script.

Many thanks in advance!

1 Like

Same issue with ks install, any updates on if resolved?

We’re still experiencing this issue with the 470 build on RHEL 8.4. Are there any workarounds or solutions?

Hi all,

there is still a bug in NVIDIA cuda packaging related to Kickstart Install Fails When Trying to Install Packages from NVIDIA Repository.

Additionally, there is still a second bug related to the first in this topic (missing /etc/default/grub).
Fortunately, one can overcome this initial issue with the new %pre-install section in EL8.

As a result, I’ve found a workaround for EL8(.4). The relevant bits of the Anaconda kickstart file are the following:

[...]
repo --name=nvidia --baseurl=https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64
repo --name=epele --baseurl=http://download.fedoraproject.org/pub/epel/8/Everything/x86_64/
repo --name=epelm --baseurl=http://download.fedoraproject.org/pub/epel/8/Modular/x86_64/

%pre-install
mkdir -p /mnt/sysimage/etc/default
touch /mnt/sysimage/etc/default/grub 
%end

%packages --ignoremissing
@^minimal-environment
#kernel-devel
#@nvidia-driver:latest-dkms/ks
@nvidia-driver:latest/ks
cuda-compiler-11-4
cuda-demo-suite-11-4
cuda-libraries-11-4
cuda-libraries-dev-11-4
cuda-toolkit-11-4
cuda-tools-11-4
%end
[...]

As you can see the error of the bug above is only mitigated by not installing the offending package in one of the nvidia-driver:latest*/ks profiles.
Unfortunately, the package get also installed by the following dependency chain of
cudacuda-11-4cuda-runtime-11-4:

[root@localhost ~]# LANG=C dnf deplist cuda-runtime-11-4
Updating Subscription Management repositories.
Last metadata expiration check: 0:14:34 ago on Tue Jul 20 12:48:20 2021.
package: cuda-runtime-11-4-11.4.0-1.x86_64
  dependency: cuda-drivers >= 470.42.01
   provider: cuda-drivers-470.42.01-1.x86_64
  dependency: cuda-libraries-11-4 >= 11.4.0
   provider: cuda-libraries-11-4-11.4.0-1.x86_64

This is the reason not to install the cuda meta package directly during kickstart but the selection of all cuda-*-11-4 sub-packages without cuda-drivers.

One has to reach out to NVIDIA software engineers to fix both issues:

  1. Don’t use /bin/sh but Lua in %pretrans scripts of cuda-drivers
  2. Unconditional use of /etc/default/grub in nvidia-kmod-common

Regarding post #1, changes to nvidia-kmod-common were made last July:

Regarding comment #4, see the kickstart section in the CUDA Installation Guide: Installation Guide Linux :: CUDA Toolkit Documentation

The ks (kickstart) modularity profile is the same as the default, except without the cuda-drivers meta-package.

The cuda-drivers meta-package has a %pretrans that calls a script to remove the NVIDIA driver runfile if present.

Finally, yes the cuda meta-package installs the driver too.

Instead please install the cuda-toolkit-X-Y meta-package, for example:

%packages
@^Minimal Install
@nvidia-driver:latest/ks
cuda-toolkit-11-4
%end

Thank you for the fast response but I’ve some remarks:

Regarding the change of the nvidia-kmod-common.spec:
Yes, I see the change in yum-packaging-nvidia-kmod-common/nvidia-kmod-common.spec at rhel8 · NVIDIA/yum-packaging-nvidia-kmod-common · GitHub
but not in the actual package:

[root@localhost ~]# dnf download nvidia-kmod-common
Updating Subscription Management repositories.
Last metadata expiration check: 0:01:13 ago on Wed 21 Jul 2021 10:21:13 AM CEST.
nvidia-kmod-common-470.57.02-1.el8.noarch.rpm                                                             273 kB/s |  10 kB     00:00    
[root@localhost ~]# rpm -qp --scripts nvidia-kmod-common-470.57.02-1.el8.noarch.rpm 
postinstall scriptlet (using /bin/sh):
/usr/sbin/grubby --update-kernel=ALL --args='rd.driver.blacklist=nouveau' --remove-args='nomodeset gfxpayload=vga=normal nouveau.modeset=0 nvidia-drm.modeset=1' &>/dev/null
. /etc/default/grub
if [ -z "${GRUB_CMDLINE_LINUX}" ]; then
  echo GRUB_CMDLINE_LINUX="rd.driver.blacklist=nouveau" >> /etc/default/grub
else
  for param in rd.driver.blacklist=nouveau; do
    echo ${GRUB_CMDLINE_LINUX} | grep -q $param
    [ $? -eq 1 ] && GRUB_CMDLINE_LINUX="${GRUB_CMDLINE_LINUX} ${param}"
  done
  for param in nomodeset gfxpayload=vga=normal nouveau.modeset=0 nvidia-drm.modeset=1; do
    echo ${GRUB_CMDLINE_LINUX} | grep -q $param
    [ $? -eq 0 ] && GRUB_CMDLINE_LINUX="$(echo ${GRUB_CMDLINE_LINUX} | sed -e "s/$param//g")"
  done
  sed -i -e "s|^GRUB_CMDLINE_LINUX=.*|GRUB_CMDLINE_LINUX=\"${GRUB_CMDLINE_LINUX}\"|g" /etc/default/grub
fi
preuninstall scriptlet (using /bin/sh):
if [ "$1" -eq "0" ]; then
  /usr/sbin/grubby --update-kernel=ALL --remove-args='rd.driver.blacklist=nouveau' &>/dev/null
  for param in rd.driver.blacklist=nouveau; do
    echo ${GRUB_CMDLINE_LINUX} | grep -q $param
    [ $? -eq 0 ] && GRUB_CMDLINE_LINUX="$(echo ${GRUB_CMDLINE_LINUX} | sed -e "s/$param//g")"
  done
  sed -i -e "s|^GRUB_CMDLINE_LINUX=.*|GRUB_CMDLINE_LINUX=\"${GRUB_CMDLINE_LINUX}\"|g" /etc/default/grub
fi ||:

if [ $1 -eq 0 ] ; then 
        # Package removal, not upgrade 
        systemctl --no-reload disable --now nvidia-fallback.service &>/dev/null || : 
fi
postuninstall program: /bin/sh

Regarding cuda-drivers:
Unfortunately, I don’t understand what the problem is to fix the package %pretrans to make it work for kickstart and not to workaround the problem? Normally, the ks and the default profiles are the same and could be avoided at all costs. Additionally, at the moment no one is able to kickstart a SXM based system like a HGX machine as the fm profile is broken, too.
What is the way to contribute to fix the cuda-drivers %pretrans script?

Regarding installing meta-packages of the cuda toolchain:

I think that is not ideal. One likes to install just plain cuda and get newer versions just by updating the local mirrored repository (we speak about automated kickstart installations e.g. of clusters here) and not by modifying the kickstart file (or package list) every time when a new release is born.

Hi @rene.oertel
Thank you for pointing that out. It appears that there are some git commits for nvidia-kmod-common, that were not merged into the 470 branch.

Yes, it is unfortunate that cuda-drivers package needs to first try to remove the NVIDIA driver runfile. Various attempts were made to use Lua instead but it was not possible to satisfy both scenarios.

The cuda package depends on cuda-drivers which will install the latest appropriate NVIDIA driver packages per distro. This would install the :latest-dkms stream, which I think is not what you are after. Perhaps a branch-less cuda-toolkit package would satisfy this requirement?

For nvidia-kmod-common: Is there any schedule when there changes get merged to the public (CUDA) driver release?

I just want to install cuda in kickstart/other automated first time installation scenarios.

We’ve to fix the cuda-drivers package with Lua. I think the script - which calls the uninstaller - embedded in the rpm should be fixable and there is no need to modify the uninstaller code itself which removes the NVIDIA driver runfile as it is uncommon, not to say nearly impossible, that in kickstart environments the runfile NVIDIA driver exists.

Hi @rene.oertel
Are you still seeing this issue?

I am having a similar problem because cuda-license-10-1 (and cuda-license-10-2) invokes “cat” in the postinstall scriptlet, but the package does not include a dependency on /bin/cat (coreutils). So the Kickstart install fails.

The cuda-license-X-Y package is removed in CUDA 11.0 and newer. Is it possible to use a newer toolkit version?

@kmittman I’ll try to test it with an actual installation this week again, but a colleague of mine had still the same issue two weeks ago or so. I’ll verify the actual environment as sometimes some local legacy files or yum repositories cause some additional trouble.

@kmittman Thank you for your reply. I misidentified the cause, but I am still having the problem. Let me explain…

 $ dnf --repofrompath=NVIDIA,https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64 --repo=NVIDIA repoquery --requires --resolve nvidia-driver-cuda-510.47.03
Added NVIDIA repo from https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64
Last metadata expiration check: 0:04:15 ago on Tue 19 Apr 2022 01:00:53 PM PDT.
cuda-nvml-dev-10-1-0:10.1.243-1.x86_64
cuda-nvml-dev-10-2-0:10.2.89-1.x86_64
cuda-nvml-devel-11-0-0:11.0.167-1.x86_64
nvidia-driver-NVML-3:510.47.03-1.el8.x86_64
nvidia-driver-cuda-libs-3:510.47.03-1.el8.x86_64
nvidia-persistenced-3:510.47.03-1.el8.x86_64

What’s happening is that nvidia-driver-cuda requires libnvidia-ml.so.1()(64bit), which is provided by several packages, including some very old ones (like cuda-nvml-dev). These, in turn, require cuda-license-X-Y, resulting in the failure I am seeing. Anaconda gets confused and it tries to install the older packages, at least for me using this repository.

I do not know enough about RPM (or Anaconda) to know how to fix this. Maybe some Obsoletes or Epoch header in the RPM spec?

For now I will just remove the unwanted packages from my local mirror. But it would be nice to fix this.

The datacenter-gpu-manager also requires /bin/sh in its pretrans scriptlet, even though the scriptlet is empty… So I cannot install this useful package with a Kickstart.

Any chance of convincing the DCGM folks to fix this?

RE: DCGM pinging @nkonyuchenko

Hello @lopresti,

We know about this problem and will fix it in the upcoming DCGM 2.4.x release.

WBR,
Nik