We have about 50 servers that are configured to run an AI video platform, all configured the same and working except our lab which I recently updated and I am running into issues with the lab stopping working after awhile.
The machine is running RHEL8.8
nvidia-driver v535
Cuda-toolkit 12.2
Podman 4.4
The apps on the system are container based running in podman. The first issue we were having, I pulled a bug report and there was a bunch of SELinux errors in /var/log/messages and I ran the command listed in the error and rebooted the machine and it fixed it long enough for the vendor to install everything and get it working but now today the issue is back though I dont see any SELinux issues in this bug report. I am new to using NVIDIA drivers for enterprise uses and for containers so I am hoping I can get some insight into what is going on. Attached is the bug report
nvidia-bug-report.log.gz (329.1 KB)
Thank you for your time
Edit: The command I ran before for SELinux was
ausearch -c 'nvidia-smi' --raw | audit2allow -M my-nvidiasmi
semodule -i my-nvidiasmi.pp
But its curious that the rest of the machines are working fine and dont have this module but I really have no idea what is going on with this particular machine. We’re using this as a basis for updating all of our production machines but not until this one actually works properly.