Anaconda kickstart failure Driver 450, CentOS 8.x, CUDA 11

rene.oertel · June 18, 2020, 10:59am

Dear all,

we’re running an automatic CentOS 8.1 installation with configured NVIDIA CUDA 11 repository (http://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64) and the package nvidia-driver (nvidia-driver-450.36.06-1.el8.x86_64.rpm) to be installed.
We see the following error during kickstart installation (Performing an automated installation using Kickstart):

Installing nvidia-driver.x86_64 (508/611)
Installing nvidia-kmod-common.noarch (509/611)                                 

The installation was stopped due to an error which occurred while running in non-interactive cmdline mode. Since there cannot be any questions in cmdline mode,
edit your kickstart file and retry installation.                               
The exact error message is:

Non interactive installation failed: DNF error: Error in POSTIN scriptlet in rpm package nvidia-kmod-common.

The installer will now terminate.

In /tmp/anaconda.log in the crashed installer we see:

07:07:15,589 DBG exception: running handleException
07:07:15,590 CRT exception: Traceback (most recent call last):

  File "/usr/lib64/python3.6/site-packages/pyanaconda/threading.py", line 286, in run
    threading.Thread.run(self)

  File "/usr/lib64/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)

  File "/usr/lib64/python3.6/site-packages/pyanaconda/installation.py", line 388, in doInstall
    installation_queue.start()

  File "/usr/lib64/python3.6/site-packages/pyanaconda/installation_tasks.py", line 304, in start
    item.start()

  File "/usr/lib64/python3.6/site-packages/pyanaconda/installation_tasks.py", line 304, in start
    item.start()

  File "/usr/lib64/python3.6/site-packages/pyanaconda/installation_tasks.py", line 472, in start
    self.run_task()

  File "/usr/lib64/python3.6/site-packages/pyanaconda/installation_tasks.py", line 438, in run_task
    self._task(*self._task_args, **self._task_kwargs)

  File "/usr/lib64/python3.6/site-packages/pyanaconda/payload/dnfpayload.py", line 1092, in install
    if errors.errorHandler.cb(exc) == errors.ERROR_RAISE:

  File "/usr/lib64/python3.6/site-packages/pyanaconda/errors.py", line 329, in cb
    raise NonInteractiveError("Non interactive installation failed: %s" % exn)

pyanaconda.errors.NonInteractiveError: Non interactive installation failed: DNF error: Error in POSTIN scriptlet in rpm package nvidia-kmod-common

Manual re-install in the /mnt/sysimage/ environment reveals:

[anaconda root@casc-150 ~]# chroot /mnt/sysimage/
[anaconda root@casc-150 /]# rpm --reinstall /tmp/nvidia-kmod-common-450.36.06-1.el8.noarch.rpm                                                                                                                   
warning: /tmp/nvidia-kmod-common-450.36.06-1.el8.noarch.rpm: Header V4 RSA/SHA512 Signature, key ID 7fa2af80: NOKEY
/var/tmp/rpm-tmp.jmxmLy: line 2: /etc/default/grub: No such file or directory

/etc/default/grub is provided by grub2-tools:

[anaconda root@casc-150 /]# rpm -qf /etc/default/grub 
grub2-tools-2.02-78.el8_1.1.x86_64

grub2-tools may not yet be installed or configured during anaconda kickstart installation and the file is still missing resulting in the error.

The solution might be to require either grub2-tools or the /etc/default/grub in the nvidia-kmod-common RPM or to check for the existince of the /etc/default/grub file in the postinst script.

Many thanks in advance!

jte888 · December 8, 2020, 11:39pm

Same issue with ks install, any updates on if resolved?

b-c-c · July 15, 2021, 6:33pm

We’re still experiencing this issue with the 470 build on RHEL 8.4. Are there any workarounds or solutions?

rene.oertel · July 20, 2021, 11:16am

Hi all,

there is still a bug in NVIDIA cuda packaging related to Kickstart Install Fails When Trying to Install Packages from NVIDIA Repository.

Additionally, there is still a second bug related to the first in this topic (missing /etc/default/grub).
Fortunately, one can overcome this initial issue with the new %pre-install section in EL8.

As a result, I’ve found a workaround for EL8(.4). The relevant bits of the Anaconda kickstart file are the following:

[...]
repo --name=nvidia --baseurl=https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64
repo --name=epele --baseurl=http://download.fedoraproject.org/pub/epel/8/Everything/x86_64/
repo --name=epelm --baseurl=http://download.fedoraproject.org/pub/epel/8/Modular/x86_64/

%pre-install
mkdir -p /mnt/sysimage/etc/default
touch /mnt/sysimage/etc/default/grub 
%end

%packages --ignoremissing
@^minimal-environment
#kernel-devel
#@nvidia-driver:latest-dkms/ks
@nvidia-driver:latest/ks
cuda-compiler-11-4
cuda-demo-suite-11-4
cuda-libraries-11-4
cuda-libraries-dev-11-4
cuda-toolkit-11-4
cuda-tools-11-4
%end
[...]

As you can see the error of the bug above is only mitigated by not installing the offending package in one of the nvidia-driver:latest*/ks profiles.
Unfortunately, the package get also installed by the following dependency chain of
cuda → cuda-11-4 → cuda-runtime-11-4:

[root@localhost ~]# LANG=C dnf deplist cuda-runtime-11-4
Updating Subscription Management repositories.
Last metadata expiration check: 0:14:34 ago on Tue Jul 20 12:48:20 2021.
package: cuda-runtime-11-4-11.4.0-1.x86_64
  dependency: cuda-drivers >= 470.42.01
   provider: cuda-drivers-470.42.01-1.x86_64
  dependency: cuda-libraries-11-4 >= 11.4.0
   provider: cuda-libraries-11-4-11.4.0-1.x86_64

This is the reason not to install the cuda meta package directly during kickstart but the selection of all cuda-*-11-4 sub-packages without cuda-drivers.

One has to reach out to NVIDIA software engineers to fix both issues:

Don’t use /bin/sh but Lua in %pretrans scripts of cuda-drivers
Unconditional use of /etc/default/grub in nvidia-kmod-common

kmittman · July 20, 2021, 10:52pm

Regarding post #1, changes to nvidia-kmod-common were made last July:

Skip postinstall if detect running in Anaconda / kickstart · NVIDIA/yum-packaging-nvidia-kmod-common@2728e48 · GitHub
Check that grub config exists · NVIDIA/yum-packaging-nvidia-kmod-common@8bb9580 · GitHub

Regarding comment #4, see the kickstart section in the CUDA Installation Guide: Installation Guide Linux :: CUDA Toolkit Documentation

The ks (kickstart) modularity profile is the same as the default, except without the cuda-drivers meta-package.

The cuda-drivers meta-package has a %pretrans that calls a script to remove the NVIDIA driver runfile if present.

Finally, yes the cuda meta-package installs the driver too.

Instead please install the cuda-toolkit-X-Y meta-package, for example:

%packages
@^Minimal Install
@nvidia-driver:latest/ks
cuda-toolkit-11-4
%end

rene.oertel · July 21, 2021, 9:12am

Thank you for the fast response but I’ve some remarks:

Regarding the change of the nvidia-kmod-common.spec:
Yes, I see the change in yum-packaging-nvidia-kmod-common/nvidia-kmod-common.spec at rhel8 · NVIDIA/yum-packaging-nvidia-kmod-common · GitHub
but not in the actual package:

[root@localhost ~]# dnf download nvidia-kmod-common
Updating Subscription Management repositories.
Last metadata expiration check: 0:01:13 ago on Wed 21 Jul 2021 10:21:13 AM CEST.
nvidia-kmod-common-470.57.02-1.el8.noarch.rpm                                                             273 kB/s |  10 kB     00:00    
[root@localhost ~]# rpm -qp --scripts nvidia-kmod-common-470.57.02-1.el8.noarch.rpm 
postinstall scriptlet (using /bin/sh):
/usr/sbin/grubby --update-kernel=ALL --args='rd.driver.blacklist=nouveau' --remove-args='nomodeset gfxpayload=vga=normal nouveau.modeset=0 nvidia-drm.modeset=1' &>/dev/null
. /etc/default/grub
if [ -z "${GRUB_CMDLINE_LINUX}" ]; then
  echo GRUB_CMDLINE_LINUX="rd.driver.blacklist=nouveau" >> /etc/default/grub
else
  for param in rd.driver.blacklist=nouveau; do
    echo ${GRUB_CMDLINE_LINUX} | grep -q $param
    [ $? -eq 1 ] && GRUB_CMDLINE_LINUX="${GRUB_CMDLINE_LINUX} ${param}"
  done
  for param in nomodeset gfxpayload=vga=normal nouveau.modeset=0 nvidia-drm.modeset=1; do
    echo ${GRUB_CMDLINE_LINUX} | grep -q $param
    [ $? -eq 0 ] && GRUB_CMDLINE_LINUX="$(echo ${GRUB_CMDLINE_LINUX} | sed -e "s/$param//g")"
  done
  sed -i -e "s|^GRUB_CMDLINE_LINUX=.*|GRUB_CMDLINE_LINUX=\"${GRUB_CMDLINE_LINUX}\"|g" /etc/default/grub
fi
preuninstall scriptlet (using /bin/sh):
if [ "$1" -eq "0" ]; then
  /usr/sbin/grubby --update-kernel=ALL --remove-args='rd.driver.blacklist=nouveau' &>/dev/null
  for param in rd.driver.blacklist=nouveau; do
    echo ${GRUB_CMDLINE_LINUX} | grep -q $param
    [ $? -eq 0 ] && GRUB_CMDLINE_LINUX="$(echo ${GRUB_CMDLINE_LINUX} | sed -e "s/$param//g")"
  done
  sed -i -e "s|^GRUB_CMDLINE_LINUX=.*|GRUB_CMDLINE_LINUX=\"${GRUB_CMDLINE_LINUX}\"|g" /etc/default/grub
fi ||:

if [ $1 -eq 0 ] ; then 
        # Package removal, not upgrade 
        systemctl --no-reload disable --now nvidia-fallback.service &>/dev/null || : 
fi
postuninstall program: /bin/sh

Regarding cuda-drivers:
Unfortunately, I don’t understand what the problem is to fix the package %pretrans to make it work for kickstart and not to workaround the problem? Normally, the ks and the default profiles are the same and could be avoided at all costs. Additionally, at the moment no one is able to kickstart a SXM based system like a HGX machine as the fm profile is broken, too.
What is the way to contribute to fix the cuda-drivers %pretrans script?

Regarding installing meta-packages of the cuda toolchain:

I think that is not ideal. One likes to install just plain cuda and get newer versions just by updating the local mirrored repository (we speak about automated kickstart installations e.g. of clusters here) and not by modifying the kickstart file (or package list) every time when a new release is born.

kmittman · July 26, 2021, 8:59pm

Hi @rene.oertel
Thank you for pointing that out. It appears that there are some git commits for nvidia-kmod-common, that were not merged into the 470 branch.

Yes, it is unfortunate that cuda-drivers package needs to first try to remove the NVIDIA driver runfile. Various attempts were made to use Lua instead but it was not possible to satisfy both scenarios.

The cuda package depends on cuda-drivers which will install the latest appropriate NVIDIA driver packages per distro. This would install the :latest-dkms stream, which I think is not what you are after. Perhaps a branch-less cuda-toolkit package would satisfy this requirement?

rene.oertel · July 29, 2021, 7:02am

For nvidia-kmod-common: Is there any schedule when there changes get merged to the public (CUDA) driver release?

I just want to install cuda in kickstart/other automated first time installation scenarios.

We’ve to fix the cuda-drivers package with Lua. I think the script - which calls the uninstaller - embedded in the rpm should be fixable and there is no need to modify the uninstaller code itself which removes the NVIDIA driver runfile as it is uncommon, not to say nearly impossible, that in kickstart environments the runfile NVIDIA driver exists.

kmittman · April 4, 2022, 7:50pm

Hi @rene.oertel
Are you still seeing this issue?

lopresti · April 19, 2022, 6:16pm

I am having a similar problem because cuda-license-10-1 (and cuda-license-10-2) invokes “cat” in the postinstall scriptlet, but the package does not include a dependency on /bin/cat (coreutils). So the Kickstart install fails.

kmittman · April 19, 2022, 6:26pm

The cuda-license-X-Y package is removed in CUDA 11.0 and newer. Is it possible to use a newer toolkit version?

rene.oertel · April 19, 2022, 6:34pm

@kmittman I’ll try to test it with an actual installation this week again, but a colleague of mine had still the same issue two weeks ago or so. I’ll verify the actual environment as sometimes some local legacy files or yum repositories cause some additional trouble.

lopresti · April 19, 2022, 8:14pm

@kmittman Thank you for your reply. I misidentified the cause, but I am still having the problem. Let me explain…

 $ dnf --repofrompath=NVIDIA,https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64 --repo=NVIDIA repoquery --requires --resolve nvidia-driver-cuda-510.47.03
Added NVIDIA repo from https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64
Last metadata expiration check: 0:04:15 ago on Tue 19 Apr 2022 01:00:53 PM PDT.
cuda-nvml-dev-10-1-0:10.1.243-1.x86_64
cuda-nvml-dev-10-2-0:10.2.89-1.x86_64
cuda-nvml-devel-11-0-0:11.0.167-1.x86_64
nvidia-driver-NVML-3:510.47.03-1.el8.x86_64
nvidia-driver-cuda-libs-3:510.47.03-1.el8.x86_64
nvidia-persistenced-3:510.47.03-1.el8.x86_64

What’s happening is that nvidia-driver-cuda requires libnvidia-ml.so.1()(64bit), which is provided by several packages, including some very old ones (like cuda-nvml-dev). These, in turn, require cuda-license-X-Y, resulting in the failure I am seeing. Anaconda gets confused and it tries to install the older packages, at least for me using this repository.

I do not know enough about RPM (or Anaconda) to know how to fix this. Maybe some Obsoletes or Epoch header in the RPM spec?

For now I will just remove the unwanted packages from my local mirror. But it would be nice to fix this.

lopresti · April 19, 2022, 10:42pm

The datacenter-gpu-manager also requires /bin/sh in its pretrans scriptlet, even though the scriptlet is empty… So I cannot install this useful package with a Kickstart.

Any chance of convincing the DCGM folks to fix this?

kmittman · April 21, 2022, 3:59pm

RE: DCGM pinging @nkonyuchenko

nkonyuchenko · April 21, 2022, 5:50pm

Hello @lopresti,

We know about this problem and will fix it in the upcoming DCGM 2.4.x release.

WBR,
Nik

Topic		Replies	Views
CUDA working on ubuntu-desktop not on ubuntu-server CUDA Programming and Performance	21	19109	March 13, 2014
Followed guide NVIDIA CUDA Installation Guide for Linux, failing at driver install CUDA Setup and Installation cuda , ubuntu	1	1532	October 27, 2020
"NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver" Ubuntu 16.04 CUDA Setup and Installation	79	371549	March 19, 2021
run on K40 Linux	83	4615	June 29, 2018
CUDA 4.2 Install in Ubuntu 12.04 CUDA Programming and Performance	12	19874	August 25, 2017
[INFO]: Finished with code: 256 , [ERROR]: Install of driver component failed CUDA Setup and Installation	24	180535	September 29, 2024
cuda install fail - ubuntu 14.04 CUDA Setup and Installation	8	3716	February 4, 2016
CUDA 10 installation problems on Ubuntu 18.04 CUDA Setup and Installation	24	94590	December 11, 2020
[Solved] Titan X for CUDA 7.5 login-loop error [Ubuntu 14.04] CUDA Setup and Installation	27	57654	November 6, 2022
Driver installation is failing CUDA Setup and Installation	7	1557	November 28, 2024

Anaconda kickstart failure Driver 450, CentOS 8.x, CUDA 11

Related topics