I have the following setup:
- Host runs Ubuntu 22.04
- vGPU drivers 535.129.03 installed
- Generated a KVM VM with a vGPU attached - the VM runs a single node OpenShift 4.14.6
For setting up the NVidia GPU Operator on Openshift I followed:
https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/install-gpu-operator-vgpu.html
→ There is uses “nvidia / container-images / driver · GitLab”.
→ “Change directory to the operating system name and version under the driver directory”
→ " For Red Hat OpenShift Container Platform, use a directory that includes rhel
in the directory name."
I used rhel8 - created an image and pushed it to a registry
Status:
GPU operator looks for the image ending on “rhcos4.14”
The created image with a respective tag is pulled and the build process in the openshift-driver-toolkit-ctr ends with the following error: Unable to load the kernel module ‘nvidia.ko’ (full message shown at the bottom [1])
Checks on the openshift node:
- Nouveau
lsmod | grep nouveau
→ no output - Graphics card available:
lspci | grep -e VGA -ie NVIDIA
00:01.0 VGA compatible controller: Red Hat, Inc. Virtio GPU (rev 01)
05:00.0 3D controller: NVIDIA Corporation GA102GL [RTX A5000] (rev a1)
I am not sure where the error is… did I use the wrong operating system to build the container? The the VGPU wrongly attached to the machine?
Any assistance highly appreciated. Thank you in advance.
[1]
ERROR: Unable to load the kernel module ‘nvidia.ko’. This happens most frequently when this kernel module was built against the wrong or improperly configured kernel sources, with a version of gcc that differs from the one used to build the target kernel, or if another driver, such as nouveau, is present and prevents the NVIDIA kernel module from obtaining ownership of the NVIDIA device(s), or no NVIDIA device installed in this system is supported by this NVIDIA Linux graphics driver release.
Please see the log entries ‘Kernel module load error’ and ‘Kernel messages’ at the end of the file ‘/var/log/nvidia-installer.log’ for more information.
ERROR: Installation has failed. Please see the file ‘/var/log/nvidia-installer.log’ for details. You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.
Kernel module compilation complete.
Kernel module load error: No such device
Kernel messages:
NVRM: installed in this system is not supported by the
NVRM: NVIDIA 535.129.03 driver release.
NVRM: Please see ‘Appendix A - Supported NVIDIA GPU Products’
NVRM: in this release’s README, available on the operating system
NVRM: specific graphics driver download page at www.nvidia.com.
[ 557.289879] nvidia: probe of 0000:05:00.0 failed with error -1
[ 557.289966] NVRM: The NVIDIA probe routine failed for 1 device(s).
[ 557.289968] NVRM: None of the NVIDIA devices were initialized.
[ 557.290382] nvidia-nvlink: Unregistered Nvlink Core, major device number 511
[ 642.973032] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[ 642.973108] IPv6: ADDRCONF(NETDEV_CHANGE): 601fa6e42f497c6: link becomes ready