Signing Nvidia Drivers

Hi all,
I’m working on a Kubernetes single-node cluster based on Ubuntu23 with RKE2 and I want to use Nvidia drivers signed based on the GPU-Operator helm chart.
The current driver that we are using is 535.129.03
In the past with ubuntu22 we were using this image:

nvcr.io/nvidia/driver:535.129.03-ubuntu22.04

Now, for ubuntu23 we are able to compile the drivers on the fly and signing the Nvidia kernel module like this:

sh NVIDIA-Linux-x86_64-535.129.03.run -x \
--module-signing-secret-key=/path/to/signing.key \
--module-signing-public-key=/path/to/signing.x509

based on this:
https://download.nvidia.com/XFree86/Linux-x86/361.45.11/README/installdriver.html#modulesigning

but, with the _create_driver_package() function we have issues to sign the driver kernel modules using the ‘donkey’ tool:

This is the snippet from the source code:

if [ -n "${PRIVATE_KEY}" ]; then
  echo "Signing NVIDIA driver kernel modules..."
  donkey get "${PRIVATE_KEY}" sh -c "PATH=${PATH}:/usr/src/linux-headers-${KERNEL_VERSION}/scripts && \
    sign-file sha512 \$DONKEY_FILE pubkey.x509 nvidia.ko nvidia.ko.sign && \
    sign-file sha512 \$DONKEY_FILE pubkey.x509 nvidia-modeset.ko nvidia-modeset.ko.sign && \
    sign-file sha512 \$DONKEY_FILE pubkey.x509 nvidia-uvm.ko"
  nvidia_sign_args="--linked-module nvidia.ko --signed-module nvidia.ko.sign"
  nvidia_modeset_sign_args="--linked-module nvidia-modeset.ko --signed-module nvidia-modeset.ko.sign"
  nvidia_uvm_sign_args="--signed"
fi

All the time the process stop here saying:

"donkey: could not establish connection: Connection refused"

at the end of the nvidia driver dockerfile the entrypoint is this:
ENTRYPOINT ["/usr/local/bin/nvidia-driver", "init"]

I really appreciate if any of you can help me to solve this issue or if you can give me an advice about how to solve it.
Thanks in advance!

Have a nice week,
Antonio.

Did you find a Solution?

Did you run this code in Dockerfile? Did it work?

I’ve tried to sign as shown in the above code, in the dockerfile, tried also using the donkey but both ways throws selinux error for trying to load an unsigned module.

As per donkey docs, the Container must run with networktype host, maybe it is the cause “cannot establish connection” error or probably because there is no previous “donkey set” command.

Anyways, as stated in the GitHub project, the porpuse of the tool is to “protect sensitive information”, so, i am going to try to bypass “donkey” and just use sign-file script.

You can simply generate a key in the dockerfile and then build it.

After that, in the runtime Container, you use the same arguments you’ve mentioned. Just use those arguments in nvidia-install command to load the keys you previously generated. It worked for me.

But, my problem was related to a bug in selinux on fcos-34 so i had to create a policy to “workaround” this bug.

Hi!
thanks for the answer! I had lost hope with someone that could help me with it.

I removed donkey step too, so I can pass to the script the keys directly and I have replaced the sign-file binary for kmodsign and I added the nvidia.ko to sign it directly:

    if [ -n "${PRIVATE_KEY}" ]; then
        echo "Signing NVIDIA driver kernel modules..."
        sh -c "PATH=${PATH}:/usr/src/linux-headers-${KERNEL_VERSION}/scripts && \
          kmodsign sha512 ${PRIVATE_KEY} /drivers/kernel/pubkey.x509 nvidia.ko && \
          kmodsign sha512 ${PRIVATE_KEY} /drivers/kernel/pubkey.x509 nvidia.ko nvidia.ko.sign && \
          kmodsign sha512 ${PRIVATE_KEY} /drivers/kernel/pubkey.x509 nvidia-modeset.ko nvidia-modeset.ko.sign && \
          kmodsign sha512 ${PRIVATE_KEY} /drivers/kernel/pubkey.x509 nvidia-uvm.ko"
        ls -l
        nvidia_sign_args="--linked-module nvidia.ko --signed-module nvidia.ko.sign"
        nvidia_modeset_sign_args="--linked-module nvidia-modeset.ko --signed-module nvidia-modeset.ko.sign"
        nvidia_uvm_sign_args="--signed"
        
        echo "modinfo -F signer nvidia.ko"
        modinfo -F signer nvidia.ko
        echo "modinfo -F signer nvidia-uvm.ko"
        modinfo -F signer nvidia-uvm.ko
    fi

afterwards, once the container is running the logs show me that the modules were signed.

Relinking NVIDIA driver kernel modules...
ld: warning: ./nvidia/nv-kernel.o_binary: missing .note.GNU-stack section implies executable stack
ld: NOTE: This behaviour is deprecated and will be removed in a future version of the linker
Signing NVIDIA driver kernel modules...
total 359276
-rw-r--r-- 1 root root    10949 Feb 18 00:02 Kbuild
-rw-r--r-- 1 root root     4801 Feb 18 00:02 Makefile
-rw-r--r-- 1 root root     9002 Jul  5 11:20 Module.symvers
drwxr-xr-x 3 root root     4096 Feb 18 00:04 common
drwxr-xr-x 4 root root     4096 Jul  5 11:19 conftest
-rwxr-xr-x 1 root root   251820 Feb 17 22:33 conftest.sh
-rw-r--r-- 1 root root      922 Feb 17 22:33 count-lines.mk
-rw-r--r-- 1 root root      239 Jul  5 11:20 modules.order
-rw-r--r-- 1 root root 12714128 Jul  5 11:20 nv-linux.o
-rw-r--r-- 1 root root   858720 Jul  5 11:20 nv-modeset-linux.o
-rw-r--r-- 1 root root       68 Jul  5 11:19 nv_compiler.h
drwxr-xr-x 5 root root    12288 Jul  5 11:20 nvidia
drwxr-xr-x 2 root root     4096 Jul  5 11:20 nvidia-drm
-rw-r--r-- 1 root root  4159744 Jul  5 11:20 nvidia-drm.ko
-rw-r--r-- 1 root root     1107 Jul  5 11:19 nvidia-drm.mod
-rw-r--r-- 1 root root    13755 Jul  5 11:20 nvidia-drm.mod.c
-rw-r--r-- 1 root root   157056 Jul  5 11:20 nvidia-drm.mod.o
-rw-r--r-- 1 root root  4005264 Jul  5 11:20 nvidia-drm.o
drwxr-xr-x 2 root root     4096 Jul  5 11:20 nvidia-modeset
-rw-r--r-- 1 root root  2499600 Jul  5 11:20 nvidia-modeset.ko
-rw-r--r-- 1 root root  2500045 Jul  5 11:20 nvidia-modeset.ko.sign
-rw-r--r-- 1 root root      205 Jul  5 11:19 nvidia-modeset.mod
-rw-r--r-- 1 root root     6740 Jul  5 11:20 nvidia-modeset.mod.c
-rw-r--r-- 1 root root   153600 Jul  5 11:20 nvidia-modeset.mod.o
-rw-r--r-- 1 root root  2348936 Jul  5 11:20 nvidia-modeset.o
drwxr-xr-x 2 root root     4096 Jul  5 11:20 nvidia-peermem
-rw-r--r-- 1 root root   389808 Jul  5 11:20 nvidia-peermem.ko
-rw-r--r-- 1 root root       66 Jul  5 11:19 nvidia-peermem.mod
-rw-r--r-- 1 root root     1133 Jul  5 11:20 nvidia-peermem.mod.c
-rw-r--r-- 1 root root   150544 Jul  5 11:20 nvidia-peermem.mod.o
-rw-r--r-- 1 root root   241008 Jul  5 11:20 nvidia-peermem.o
drwxr-xr-x 3 root root    20480 Jul  5 11:20 nvidia-uvm
-rw-r--r-- 1 root root 53730757 Jul  5 11:20 nvidia-uvm.ko
-rw-r--r-- 1 root root     7559 Jul  5 11:19 nvidia-uvm.mod
-rw-r--r-- 1 root root    17723 Jul  5 11:20 nvidia-uvm.mod.c
-rw-r--r-- 1 root root   158232 Jul  5 11:20 nvidia-uvm.mod.o
-rw-r--r-- 1 root root 53574720 Jul  5 11:20 nvidia-uvm.o
-rw-r--r-- 1 root root 76571533 Jul  5 11:20 nvidia.ko
-rw-r--r-- 1 root root 76571978 Jul  5 11:20 nvidia.ko.sign
-rw-r--r-- 1 root root     2609 Jul  5 11:19 nvidia.mod
-rw-r--r-- 1 root root    29023 Jul  5 11:20 nvidia.mod.c
-rw-r--r-- 1 root root   220656 Jul  5 11:20 nvidia.mod.o
-rw-r--r-- 1 root root 76396528 Jul  5 11:20 nvidia.o
modinfo -F signer nvidia.ko
<CN>
modinfo -F signer nvidia-uvm.ko
<CN>

But I have this error on the compilation process when nvidia try to load the modules:

Welcome to the NVIDIA Software Installer for Unix/Linux

Detected 20 CPUs online; setting concurrency level to 20.
Installing NVIDIA driver version 535.161.07.
Performing CC sanity check with CC="/usr/bin/cc".
Performing CC check.
Kernel source path: '/lib/modules/6.8.0-35-generic/build'

Kernel output path: '/lib/modules/6.8.0-35-generic/build'

Performing Compiler check.
Performing Dom0 check.
Performing Xen check.
Performing PREEMPT_RT check.
Performing vgpu_kvm check.
Cleaning kernel module build directory.
Building kernel modules
  : [##############################] 100%

ERROR: Unable to load the kernel module 'nvidia.ko'.  This happens most frequently when this kernel module was built against the wrong or improperly configured kernel sources, with a version of gcc that differs from the one used to build the target kernel, or if another driver, such as nouveau, is present and prevents the NVIDIA kernel module from obtaining ownership of the NVIDIA device(s), or no NVIDIA device installed in this system is supported by this NVIDIA Linux graphics driver release.

Please see the log entries 'Kernel module load error' and 'Kernel messages' at the end of the file '/var/log/nvidia-installer.log' for more information.

ERROR: Installation has failed.  Please see the file '/var/log/nvidia-installer.log' for details.  You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.

Kernel module compilation complete.
Unable to determine if Secure Boot is enabled: No such file or directory
Kernel module load error: Key was rejected by service
Kernel messages:
[15237.195909] docker0: port 1(veth9b46407) entered blocking state
[15237.195916] docker0: port 1(veth9b46407) entered disabled state
[15237.195934] veth9b46407: entered allmulticast mode
[15237.195994] veth9b46407: entered promiscuous mode
[15237.196494] docker0: port 2(veth1b83b48) entered disabled state
[15237.197551] veth1b83b48 (unregistering): left allmulticast mode
[15237.197553] veth1b83b48 (unregistering): left promiscuous mode
[15237.197560] docker0: port 2(veth1b83b48) entered disabled state
[15237.402930] eth0: renamed from vetha1b1b5d
[15237.411020] docker0: port 1(veth9b46407) entered blocking state
[15237.411030] docker0: port 1(veth9b46407) entered forwarding state
[15237.452814] vetha1b1b5d: renamed from eth0
[15237.466296] docker0: port 1(veth9b46407) entered disabled state
[15237.483747] docker0: port 1(veth9b46407) entered disabled state
[15237.484038] veth9b46407 (unregistering): left allmulticast mode
[15237.484041] veth9b46407 (unregistering): left promiscuous mode
[15237.484049] docker0: port 1(veth9b46407) entered disabled state
[15347.929599] VFIO - User Level meta-driver version: 0.3
[15348.041634] Loading of unsigned module is rejected
[15413.411620] VFIO - User Level meta-driver version: 0.3
[15413.516083] Loading of unsigned module is rejected
[15472.262948] VFIO - User Level meta-driver version: 0.3
[15472.366325] Loading of unsigned module is rejected
[15593.975027] VFIO - User Level meta-driver version: 0.3
[15594.083684] Loading of unsigned module is rejected
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...

In addition I added too the options to the nvidia-installer script:

init() {
    if [ "${DRIVER_TYPE}" = "vgpu" ]; then
        _find_vgpu_driver_version || exit 1
    fi

    # Install the userspace components and copy the kernel module sources.
    sh NVIDIA-Linux-$DRIVER_ARCH-$DRIVER_VERSION.run -x \
        --module-signing-secret-key="${PRIVATE_KEY}" \
        --module-signing-public-key=/drivers/kernel/pubkey.x509 && \
        cd NVIDIA-Linux-$DRIVER_ARCH-$DRIVER_VERSION && \
        ./nvidia-installer --silent \
                    --module-signing-secret-key="${PRIVATE_KEY}" \
                    --module-signing-public-key=/drivers/kernel/pubkey.x509 \
                    --no-dkms \
                    --force-selinux=no \
                    --disable-nouveau \
                    --no-kernel-module \
                    --no-nvidia-modprobe \
                    --no-rpms \
                    --no-backup \
                    --no-check-for-alternate-installs \
                    --no-libglx-indirect \
                    --no-install-libglvnd \
                    --x-prefix=/tmp/null \
                    --x-module-path=/tmp/null \
                    --x-library-path=/tmp/null \
                    --x-sysconfig-path=/tmp/null && \
        mkdir -p /usr/src/nvidia-${DRIVER_VERSION} && \
        mv LICENSE mkprecompiled ${KERNEL_TYPE} /usr/src/nvidia-${DRIVER_VERSION} && \
        sed '9,${/^\(kernel\|LICENSE\)/!d}' .manifest > /usr/src/nvidia-${DRIVER_VERSION}/.manifest

    echo -e "\n========== NVIDIA Software Installer ==========\n"
    echo -e "Starting installation of NVIDIA driver version ${DRIVER_VERSION} for Linux kernel version ${KERNEL_VERSION}\n"

Were you able to solve the issue?

[Solved]
All the magic was in passing the arguments to the _install_driver() funtion:

_install_driver() {
    local install_args=()

    echo "Installing NVIDIA driver kernel modules..."
    cd /usr/src/nvidia-${DRIVER_VERSION}
    if [ -d /lib/modules/${KERNEL_VERSION}/kernel/drivers/video ]; then
        rm -rf /lib/modules/${KERNEL_VERSION}/kernel/drivers/video
    else
        rm -rf /lib/modules/${KERNEL_VERSION}/video
    fi

    if [ "${ACCEPT_LICENSE}" = "yes" ]; then
        install_args+=("--accept-license")
    fi

    nvidia-installer --module-signing-secret-key="${PRIVATE_KEY}" \
                     --module-signing-public-key=/drivers/kernel/pubkey.x509 \
                     --kernel-module-only --no-drm --ui=none --no-nouveau-check -m=${KERNEL_TYPE} ${install_args[@]+"${install_args[@]}"}
}

btw, I have the argument ACCEPT_LICENSE=“”.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.