Installing the GPU operator on SELinux enforced-nodes

diana.gaponcic · June 12, 2023, 9:50am

Hi,

I have an issue with installing the GPU operator on nodes with enforced SELinux.

SELinux is enabled on the nodes, but disabled in containerd. As a result, the domains are not propagated to the containers. If we check the drivers’ directory on the node, the video folder is of type modules_object_t (the type needed by kernel modules):

$ ls -Z /usr/lib/modules/5.19.9-200.fc36.x86_64/kernel/drivers
...
system_u:object_r:modules_object_t:s0 video
...

While on the pod (nvidia-driver-daemonset pods) the same folder is of type var_lib_t:

$ ls -Z /usr/lib/modules/5.19.9-200.fc36.x86_64/kernel/drivers/
...
system_u:object_r:var_lib_t:s0 video
...

When trying to install the gpu-operator, the installation fails, and the error on the node is:

[ 5039.462792 audit: type=1400 audit(1684925801.740:1438): avc:  denied  { module_load } for  pid=189114 comm="modprobe" path="/usr/lib/modules/5.19.9-200.fc36.x86_64/kernel/drivers/video/nvidia.ko" dev="overlay" ino=95506836 scontext=system_u:system_r:unconfined_service_t:s0 tcontext=system_u:object_r:var_lib_t:s0 tclass=system permissive=0

The error tells me that unconfined_service_t is not allowed to do module_load on type var_lib_t. This is the expected behavior because modules_object_t is needed to denote it is loadable. But because containerd is not propagating the types into the pods, I’m losing the modules_object_t type on the drivers and cannot load them.

One easy fix is to create a policy: allow unconfined_service_t var_lib_t:system module_load. This is better than running SELinux permissive but still gives more permissions than required. So I was wondering if there is some other way to solve this. Enabling SELinux on containerd is not an option.

If I run chcon -t modules_object_t /usr/lib/modules/5.19.9-200.fc36.x86_64/kernel/drivers/video before nvidia-driver init the installation succeeds, but there is no easy way to do this using values file.

Versions used:

Kubernetes v1.25
Fedora Core OS 36
containerd 1.7.1
gpu-operator 22.9.1

Any ideas would be highly appreciated.

Best regards, Diana.

vkhomyakov · June 20, 2023, 10:42am

Hello Diana,

I tried to find a solution for your scenario, but unfortunately had no luck. It looks more complex and requires some reproduction.

If you have actual Enterprise Support contract, please open the case and we’ll do our best to reproduce the issue and engage experts to look for a fix or workaround to avoid the issue with SELinux.

Regards,
Vladislav

gernot.seidler · June 21, 2023, 10:37pm

We have encountered the same issue.
OS: RHEL 8.8
Kubernetes: v1.24.9
containerd: 1.6.15
gpu-operator: v23.3.2

Topic		Replies	Views
ERROR: NVIDIA driver is not loaded Linux ubuntu	0	173	November 30, 2024
F24 permission problems? Linux	10	2567	December 23, 2016
Adding vGPU to VM General Discussion	0	1080	January 10, 2022
Problems installing CUDA drivers for systemd containers CUDA Setup and Installation cuda , kernel , ubuntu , linux-driver	0	962	September 21, 2022
GPU operator deployment fails with nvidia-driver-daemonset pod crached Linux vmware-solutions , esxi	6	1739	October 22, 2023
Cannot install NVIDIA driver for Tesla T4 cuOpt cuda , ubuntu	4	1355	January 14, 2023
Linux, Solaris, and FreeBSD driver 304.137 (legacy for GeForce 6 and 7 series) Announcements and News	0	5007	September 19, 2017
ERROR: Unable to load the kernel module 'nvidia.ko' NVIDIA Virtual GPU Drivers	3	51581	November 4, 2021
cuda install failed: ERROR: Unable to load the kernel module 'nvidia.ko'. CUDA Setup and Installation	2	5865	September 30, 2017
The Need for Speed: Edge AI with NVIDIA GPUs and SmartNICs, Part 2 Technical Blog	0	400	December 1, 2021

Installing the GPU operator on SELinux enforced-nodes

Related topics