Root Cause Analysis for Nvidia Driver >= 560 install failure on Ubuntu 22.04

Summary

Starting about a few weeks ago, I noticed that the Nvidia Drivers >= 560 fail to install on Ubuntu 22.04, while they work fine on Ubuntu 24.04. For example, launch an Ubuntu 22.04 LTS Server with g6.xlarge (Nvidia L4 GPU) in AWS EC2, and try doing CUDA 12.6 install (example script below), which will implicitly install Nvidia Driver version 560.

CUDA Install Script

#!/bin/bash -xe
. /etc/os-release

distro=ubuntu${VERSION_ID//[.]/""}
arch="x86_64"
echo "Ubuntu  $distro/$arch"

export DEBIAN_FRONTEND=noninteractive
export DEBCONF_NONINTERACTIVE_SEEN=true

CUDA=12.6
CUDA_DASH=${CUDA//\./-}

[[ ! -z $(lspci -v | grep NVIDIA) ]] && \
[[ ! -x "$(command -v nvidia-smi)" ]] && \
apt-get -y install linux-headers-$(uname -r) && \
wget https://developer.download.nvidia.com/compute/cuda/repos/$distro/$arch/cuda-keyring_1.1-1_all.deb && \
dpkg -i cuda-keyring_1.1-1_all.deb && \
apt-get update && apt-get -y purge cuda && apt-get -y purge nvidia-* && apt-get -y purge libnvidia-* && apt-get -y autoremove && \
apt-get -y install cuda-${CUDA_DASH} && \
apt-get -y install libcudnn8 && \
apt-get -y install libcudnn8-dev && \
echo "export PATH=/usr/local/cuda-${CUDA}/bin:$PATH" >> /home/ubuntu/.bashrc && \
CUDA_COMPAT=$(nvidia-smi | grep CUDA | awk '{print $(NF - 1)}') && \
CUDA_COMPAT_DASH=${CUDA_COMPAT//\./-} && \
apt-get -y install cuda-compat-${CUDA_COMPAT_DASH} && \
echo "export LD_LIBRARY_PATH=/usr/local/cuda-${CUDA_COMPAT}/compat:/usr/local/cuda-${CUDA}/lib64:$LD_LIBRARY_PATH" >> /home/ubuntu/.bashrc && \
reboot

Failure

The above script will fail to install Nvidia driver, and thus, CUDA 12.6. Failure will show up s an Nvidia package install failure due to an Nvidia module Make build error.

Root Cause Analysis

I did root cause analysis for this failure and I am sharing so others may benefit from it:

  • The first thing to understand is that installing CUDA 12.6 implicitly installs gcc-11 packages for Ubuntu 22.04
  • Second thing to understand is version gcc-11 leads to an error in an Nvidia module build: cc: error: unrecognized command-line option ‘-ftrivial-auto-var-init=zero’

Suggested Fix

The fix is to use gcc-12, or higher. However, how do we accomplish this fix?

I first just tried installing gcc-12 before installing CUDA 12.6, but that is not enough, because installing CUDA 12.6 installs gcc-11 anyway, and makes gcc-11 the default, leading to the build failure documented above.

The solution I found was to make sure every single gcc-11 and gcc-12 related package is pre-installed, gcc-12 is set as default, and only then CUDA12.6 is installed on ubuntu 22.04. If you leave out any gcc related package, the solution will fail. Here is what the solution looks like:

 ( ( [[ "$VERSION_ID" == 22.04* ]] && apt-get -y install build-essential gcc g++ cpp-11 gcc-11 g++-11 gcc-11-base libgcc-11-dev libstdc++-11-dev \
                        cpp-12 gcc-12 g++-12 libgcc-12-dev libstdc++-12-dev && \
                  update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-11 50 --slave /usr/bin/g++ g++ /usr/bin/g++-11 && \
                  update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 100 --slave /usr/bin/g++ g++ /usr/bin/g++-12) || : ) 

There maybe a better way to fix this, and I would love to hear about it. Of course, if Nvidia CUDA were not installing gcc-11 packages automatically, but instead installing compatible gcc-12, or higher packages on ubuntu22.04, it would be ideal!