Hi all,
I’m working on a Kubernetes single-node cluster based on Ubuntu23 with RKE2 and I want to use Nvidia drivers signed based on the GPU-Operator helm chart.
The current driver that we are using is 535.129.03
In the past with ubuntu22 we were using this image:
I’ve tried to sign as shown in the above code, in the dockerfile, tried also using the donkey but both ways throws selinux error for trying to load an unsigned module.
As per donkey docs, the Container must run with networktype host, maybe it is the cause “cannot establish connection” error or probably because there is no previous “donkey set” command.
Anyways, as stated in the GitHub project, the porpuse of the tool is to “protect sensitive information”, so, i am going to try to bypass “donkey” and just use sign-file script.
You can simply generate a key in the dockerfile and then build it.
After that, in the runtime Container, you use the same arguments you’ve mentioned. Just use those arguments in nvidia-install command to load the keys you previously generated. It worked for me.
But, my problem was related to a bug in selinux on fcos-34 so i had to create a policy to “workaround” this bug.
Hi!
thanks for the answer! I had lost hope with someone that could help me with it.
I removed donkey step too, so I can pass to the script the keys directly and I have replaced the sign-file binary for kmodsign and I added the nvidia.ko to sign it directly:
But I have this error on the compilation process when nvidia try to load the modules:
Welcome to the NVIDIA Software Installer for Unix/Linux
Detected 20 CPUs online; setting concurrency level to 20.
Installing NVIDIA driver version 535.161.07.
Performing CC sanity check with CC="/usr/bin/cc".
Performing CC check.
Kernel source path: '/lib/modules/6.8.0-35-generic/build'
Kernel output path: '/lib/modules/6.8.0-35-generic/build'
Performing Compiler check.
Performing Dom0 check.
Performing Xen check.
Performing PREEMPT_RT check.
Performing vgpu_kvm check.
Cleaning kernel module build directory.
Building kernel modules
: [##############################] 100%
ERROR: Unable to load the kernel module 'nvidia.ko'. This happens most frequently when this kernel module was built against the wrong or improperly configured kernel sources, with a version of gcc that differs from the one used to build the target kernel, or if another driver, such as nouveau, is present and prevents the NVIDIA kernel module from obtaining ownership of the NVIDIA device(s), or no NVIDIA device installed in this system is supported by this NVIDIA Linux graphics driver release.
Please see the log entries 'Kernel module load error' and 'Kernel messages' at the end of the file '/var/log/nvidia-installer.log' for more information.
ERROR: Installation has failed. Please see the file '/var/log/nvidia-installer.log' for details. You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.
Kernel module compilation complete.
Unable to determine if Secure Boot is enabled: No such file or directory
Kernel module load error: Key was rejected by service
Kernel messages:
[15237.195909] docker0: port 1(veth9b46407) entered blocking state
[15237.195916] docker0: port 1(veth9b46407) entered disabled state
[15237.195934] veth9b46407: entered allmulticast mode
[15237.195994] veth9b46407: entered promiscuous mode
[15237.196494] docker0: port 2(veth1b83b48) entered disabled state
[15237.197551] veth1b83b48 (unregistering): left allmulticast mode
[15237.197553] veth1b83b48 (unregistering): left promiscuous mode
[15237.197560] docker0: port 2(veth1b83b48) entered disabled state
[15237.402930] eth0: renamed from vetha1b1b5d
[15237.411020] docker0: port 1(veth9b46407) entered blocking state
[15237.411030] docker0: port 1(veth9b46407) entered forwarding state
[15237.452814] vetha1b1b5d: renamed from eth0
[15237.466296] docker0: port 1(veth9b46407) entered disabled state
[15237.483747] docker0: port 1(veth9b46407) entered disabled state
[15237.484038] veth9b46407 (unregistering): left allmulticast mode
[15237.484041] veth9b46407 (unregistering): left promiscuous mode
[15237.484049] docker0: port 1(veth9b46407) entered disabled state
[15347.929599] VFIO - User Level meta-driver version: 0.3
[15348.041634] Loading of unsigned module is rejected
[15413.411620] VFIO - User Level meta-driver version: 0.3
[15413.516083] Loading of unsigned module is rejected
[15472.262948] VFIO - User Level meta-driver version: 0.3
[15472.366325] Loading of unsigned module is rejected
[15593.975027] VFIO - User Level meta-driver version: 0.3
[15594.083684] Loading of unsigned module is rejected
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
In addition I added too the options to the nvidia-installer script:
init() {
if [ "${DRIVER_TYPE}" = "vgpu" ]; then
_find_vgpu_driver_version || exit 1
fi
# Install the userspace components and copy the kernel module sources.
sh NVIDIA-Linux-$DRIVER_ARCH-$DRIVER_VERSION.run -x \
--module-signing-secret-key="${PRIVATE_KEY}" \
--module-signing-public-key=/drivers/kernel/pubkey.x509 && \
cd NVIDIA-Linux-$DRIVER_ARCH-$DRIVER_VERSION && \
./nvidia-installer --silent \
--module-signing-secret-key="${PRIVATE_KEY}" \
--module-signing-public-key=/drivers/kernel/pubkey.x509 \
--no-dkms \
--force-selinux=no \
--disable-nouveau \
--no-kernel-module \
--no-nvidia-modprobe \
--no-rpms \
--no-backup \
--no-check-for-alternate-installs \
--no-libglx-indirect \
--no-install-libglvnd \
--x-prefix=/tmp/null \
--x-module-path=/tmp/null \
--x-library-path=/tmp/null \
--x-sysconfig-path=/tmp/null && \
mkdir -p /usr/src/nvidia-${DRIVER_VERSION} && \
mv LICENSE mkprecompiled ${KERNEL_TYPE} /usr/src/nvidia-${DRIVER_VERSION} && \
sed '9,${/^\(kernel\|LICENSE\)/!d}' .manifest > /usr/src/nvidia-${DRIVER_VERSION}/.manifest
echo -e "\n========== NVIDIA Software Installer ==========\n"
echo -e "Starting installation of NVIDIA driver version ${DRIVER_VERSION} for Linux kernel version ${KERNEL_VERSION}\n"